• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hippo get together presentation   solr integration
 

Hippo get together presentation solr integration

on

  • 6,422 views

By Ard Schrijvers

By Ard Schrijvers

Statistics

Views

Total Views
6,422
Views on SlideShare
6,422
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hippo get together presentation   solr integration Hippo get together presentation solr integration Presentation Transcript

    • Solr integrationApril 20, 2012Ard Schrijvers • a.schrijvers@onehippo.com /ard@apache.org
    • About me: Ard Schrijvers1. Working at Hippo since 20012. Email: a.schrijvers@onehippo.com ard@apache.org3. Worked primarily on: 1. HST 2. Hippo Repository / Jackrabbit 3. Lucene 4. Cocoon 5. Slide4. Apache committer of Jackrabbit and Cocoon
    • Outline1. The current search (HST / repo) architecture
    • Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches
    • Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives
    • Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue
    • Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo
    • Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up
    • Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up7. Questions
    • Current search architecture
    • Current search architecture So An HSTQuery is translated to an XPath queryWhich is delegated to the repository that returns a JCR NodeIterator which the HST binds back to HippoBeans
    • Current search architecture That sounds doable and not to complex is it?
    • Current search architectureWell, it is .......
    • Current search architectureWell, it is ....... very complex
    • Current search architectureReasons:1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.4
    • Current search architectureReasons:1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.42. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results
    • Current search architectureReasons:1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.42. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results3. Support for XPath / SQL was needed. However, Lucene likes flattened data, JCR with XPath / SQL is all about hierarchical data
    • Current search architectureReasons:1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.42. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results3. Support for XPath / SQL was needed. However, Lucene likes flattened data, JCR with XPath / SQL is all about hierarchical data4. JCR Nodes != Documents
    • Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A short HOWTO as developer6. A very fast demo7. Wrap up8. Questions
    • Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)
    • Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion
    • Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize
    • Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize4. A single index for an entire workspace
    • Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price of CPU, Memory and complexity
    • Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no derived field indexes
    • Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no derived field indexes7. To index external sources, the sources need to be stored in the repository
    • Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no derived field indexes7. To index external sources, the sources need to be stored in the repository8. Range queries (and others) easily blow up
    • Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no derived field indexes7. To index external sources, the sources need to be stored in the repository8. Range queries (and others) easily blow up9. Getting the number of hits is complex
    • Current problems / shortcomings / mismatches Extra problem JCR Nodes != DocumentsFor example : A news document contains a link to an authordocument : Through the author name, the news documentshould be found
    • Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up7. Questions
    • Objectives 1. Fix all the 9+ problems / shortcomings/ mismatches from previous slides 2. Easy to use and customize 3. Satisfied customers 4. Satisfied partners 5. Scalable searches : CPU, memory and large document numbers 6. Document oriented 7. Integration with HST ContentBeans (HippoBeans) 8. Index external sources 9. Control the SIZE of the index yourself10. Dont invent but integrate ( with out-of-the-box features supported by a large community)
    • Objective: Fix all the 9 problems /shortcomings/ mismatches from previous slides
    • Objective: Fix all the 9 problems / shortcomings/ mismatches from previous slidesEasy: Solr integration to rescue
    • Objective: Easy to use and customize
    • Objective: Easy to use and customize YOU will be in the driver seat
    • Objective: Easy to use and customize
    • Objective: Easy to use and customize
    • Objective: Easy to use and customizeNo more complete dependence on what the sometimes not sosmAR&D Hippo team thought was good for YOU
    • Objective : Easy to use and customize
    • Objective: Easy to use and customizeYou decide from where, what, how and when to index
    • Objective: Easy to use and customizeYou decide from where, what, how and when to index 1. from where: which sources (jcr, webpages, database, noSQL store, nuxeo, alfresco, anything)
    • Objective: Easy to use and customizeYou decide from where, what, how and when to index 1. from where: which sources (jcr, webpages, database, noSQL store, nuxeo, alfresco, anything) 2. what : which parts of a document (not jcr node) or external source
    • Objective: Easy to use and customizeYou decide from where, what, how and when to index 1. from where: which sources (jcr, webpages, database, noSQL store, nuxeo, alfresco, anything) 2. what : which parts of a document (not jcr node) or external source 3. how : 1. which analyzer, 2. index on document level, property level or both 3. store the text
    • Objective: Easy to use and customizeYou decide from where, what, how and when to index 1. from where: which sources (jcr, webpages, database, noSQL store, nuxeo, alfresco, anything) 2. what : which parts of a document (not jcr node) or external source 3. how : 1. which analyzer, 2. index on document level, property level or both 3. store the text 4. when : when do you want to index
    • Objective: Easy to use and customizeBut of course, out-of-the-box support and tooling ready to be used by YOU
    • Objective: Easy to use and customize But of course, out-of-the-box support and tooling ready to be used by YOU1. Default hippo repository indexer & observer
    • Objective: Easy to use and customize But of course, out-of-the-box support and tooling ready to be used by YOU1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing
    • Objective: Easy to use and customize But of course, out-of-the-box support and tooling ready to be used by YOU1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBeans
    • Objective: Easy to use and customize But of course, out-of-the-box support and tooling ready to be used by YOU1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBeans4. Deployment support
    • Objective: Easy to use and customize But of course, out-of-the-box support and tooling ready to be used by YOU1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBeans4. Deployment support5. Clustering support
    • Objective: Satisfied customers
    • Objective: Satisfied customers HOW?
    • Objective: Satisfied customers EASY
    • Objective: Satisfied customers Most likely they just will be satisfied
    • Objective: Satisfied customersIf they are not satisfied enough you can: 1. Easily customize it (aka tune it until je een ons weegt) 2. Hire anyone with Solr experience : All our partners have Solr experience
    • Objective: Satisfied customersStill not satisfied? Let them pay too much for a Google Search appliance, Autonomy or any of the other useless to pay for software
    • Objective: Satisfied partners
    • Objective: Satisfied partnersAlthough on thin ice here, I strongly believe in this because:
    • Objective: Satisfied partners1. Our partners frequently have good knowledge about Solr
    • Objective: Satisfied partners1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations
    • Objective: Satisfied partners1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge
    • Objective: Satisfied partners1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations
    • Objective: Satisfied partners1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations5. Our partners will earn more on Hippo and have happier developers
    • Objective: Satisfied partners1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations5. Our partners will earn more on Hippo and have happier developers6. Hippo will earn more through HES: Which will satisfy partners again, because Hippo can spend more on AR&D ==> more features
    • Objective: Scalable searches
    • Objective: Scalable searches1. Using Solr to do the searches
    • Objective: Scalable searches1. Using Solr to do the searches2. Not the complex JCR hierarchical searches
    • Objective: Scalable searches1. Using Solr to do the searches2. Not the complex JCR hierarchical searches3. Document oriented instead of JCR Nodes ( #docs << #nodes)
    • Objective: Document oriented
    • Objective: Document oriented What do we want to search for?
    • Objective: Document oriented Exactly, Documents!!
    • Objective: Document oriented A Document == A HippoBean != JCR Node
    • Objective: Document oriented So lets index
    • Objective: Document oriented So lets index HippoBeans (ContentBeans)
    • Objective: Integration withContentBeans (HippoBeans)
    • Objective: Integration with ContentBeans (HippoBeans)As a developer .... how am I going to index my beans?
    • Objective: Integration with ContentBeans (HippoBeans)I know how to write HippoBeans, that all I ever did in my life
    • Objective: Integration withContentBeans (HippoBeans) How do you expect me to index my beans?
    • Objective: Integration with ContentBeans (HippoBeans)Annotate your getters with @IndexField or @IndexField(name="foo")And account for them in Solr schema.xml <field name="title" type="text_general" indexed="true" stored="true" /> <field name="summary" type="text_general" indexed="true" stored="true"/>
    • Objective: Integration with ContentBeans (HippoBeans)An example:@Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField(name="samenvatting") public String getSummary() { return getProperty("demosite:summary") ; }}
    • Objective: Integration with ContentBeans (HippoBeans)Another example:@Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField public String getSummary() { return getProperty("demosite:summary") ; } @IndexField public String getAuthor() { return getLinkedBean("demosite:author", Author.class). etAuthor(); g }}
    • Objective: Integration with ContentBeans (HippoBeans)Another example:@Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField public String getSummary() { return getProperty("demosite:summary") ; } @ReIndexOnChange @IndexField public Author getAuthor() { return getLinkedBean("demosite:author", Author.class); }}
    • Objective: Integration with ContentBeans (HippoBeans)Another example: Setters@Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { private String title; private String summary; @IndexField public String getTitle() { return title == null ? getProperty("demosite:title"): title ; } public void setTitle(String title) { this.title = title; } @IndexField public String getSummary() { return summary == null ? getProperty("demosite:summary"): summary ; } public void setSummary(String summary) { this.summary = summary; }} Bonus : What can we achieve with the Setters?
    • Objective: Integration with ContentBeans (HippoBeans)Thats all you need to doAnd the HST binds some extra indexing fields like 1. The path 2. The canonicalUUID 3. The name 4. The localized name 5. The depth 6. The class hierarchy (including interfaces)
    • Objective: Index external sources
    • Objective: Index external sourcesYou can1. Push them directly to Solr
    • Objective: Index external sourcesYou can1. Push them directly to Solr2. Push them to a HST JAX-RS resource that binds to a ContentBean and commits to Solr
    • Objective: Index external sourcesYou can1. Push them directly to Solr2. Push them to a HST JAX-RS resource that binds to a ContentBean and commits to Solr3. Crawl from the HST and bind to ContentBeans and commit them to Solr
    • Objective: Index external sourcesA ContentBean does *not* need a JCR Node!ContentBean interface:public interface ContentBean { @IndexField(name="id") String getPath(); void setPath(String path);}
    • Objective: Index external sourcesAn example : GoGreenProductBean in Testsuitepublic class GoGreenProductBean implements ContentBean { private String path; private String title; private String summary; private String description; public String getPath() {return path;} public void setPath(final String path) {this.path = path;} @IndexField public String getTitle() {return title;} public void setTitle(String title) {this.title = title;} @IndexField public String getSummary() {return summary ;} public void setSummary(String summary) {this.summary = summary;} @IndexField public String getDescription() {return description;} public void setDescription(String description) {this.description = description;}}
    • Objective: Index external sourcesAnd add the GoGreenProductBean to Solr{ List<GoGreenProductBean> gogreenBeans = new ArrayList<GoGreenProductBean>(); // FILL THE gogreenBeans LIST // NOW ADD TO INDEX HippoSolrManager solrManager = HstServices.getComponentManager().getComponent( HippoSolrManager.class.getName(), SOLR_MODULE_NAME); try { solrManager.getSolrServer().addBeans(gogreenBeans); UpdateResponse commit = solrManager.getSolrServer().commit(); } catch (IOException e) { e.printStackTrace(); } catch (SolrServerException e) { e.printStackTrace(); }}
    • Objective: Control the SIZE of the index yourself
    • Objective: Control the SIZE of the index yourselfJCR / Jackrabbit / Hippo-Repository has a generic one-fits-all-index (or one-fits-none-index)Which grows very large easily, and can hardly be customized
    • Objective: Control the SIZE of the index yourselfHowever, search is domain specificThus, Just index what is needed for the customer
    • Objective: Dont invent but integrate
    • Objective: Dont invent but integrate Use Solr Use Solrj client Expose the Solrj SolrQuery
    • Objective: Dont invent but integrateFor example:HippoSolrManager solrManager = ...String query = ...HippoQuery hippoQuery = solrManager.createQuery(query);hippoQuery.setLimit(pageSize);hippoQuery.setOffset((page - 1) * pageSize);// hippoQuery.getSolrQuery() is the SolrQuery object// include scoringhippoQuery.getSolrQuery().setIncludeScore(true);hippoQuery.getSolrQuery().setHighlight(true);hippoQuery.getSolrQuery().setHighlightFragsize(200);hippoQuery.getSolrQuery().addHighlightField("title");hippoQuery.getSolrQuery().addHighlightField("summary");hippoQuery.getSolrQuery().addHighlightField("htmlContent");HippoQueryResult result = hippoQuery.execute(true);
    • Objective: Dont invent but integrateFor example:HippoSolrManager solrManager = ...String query = ...HippoQuery hippoQuery = solrManager.createQuery(query);hippoQuery.setLimit(pageSize);hippoQuery.setOffset((page - 1) * pageSize);// hippoQuery.getSolrQuery() is the SolrQuery object// include scoringhippoQuery.getSolrQuery().setIncludeScore(true);hippoQuery.getSolrQuery().setHighlight(true);hippoQuery.getSolrQuery().setHighlightFragsize(200);hippoQuery.getSolrQuery().addHighlightField("title");hippoQuery.getSolrQuery().addHighlightField("summary");hippoQuery.getSolrQuery().addHighlightField("htmlContent");HippoQueryResult result = hippoQuery.execute(true);
    • Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up7. Questions
    • Solr integration to rescue No further comments :-)
    • Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up7. Questions
    • A very fast demo setup~75.000 long wikipedia docs in repository ............... doing the demo .................
    • That was : a very fast demo
    • Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up7. Questions
    • Wrap upI think that with the Solr integration
    • Wrap upI think that with the Solr integration1. Developers will be happier
    • Wrap upI think that with the Solr integration1. Developers will be happier2. Customers will be happier
    • Wrap upI think that with the Solr integration1. Developers will be happier2. Customers will be happier3. Partners will be happier
    • Wrap upI think that with the Solr integration1. Developers will be happier2. Customers will be happier3. Partners will be happier4. Hippo will be happier
    • Wrap upI think that with the Solr integration 1. Developers will be happier 2. Customers will be happier 3. Partners will be happier 4. Hippo will be happierAnd finally, last and least
    • Wrap upI think that with the Solr integration1. Developers will be happier2. Customers will be happier3. Partners will be happier4. Hippo will be happier5. Infra will be happier because the servers stop sweating
    • Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up7. Questions
    • Questions?Check out the example at :http://svn.onehippo.org/repos/hippo/hippo-cms7/testsuite/trunk