OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content


Published on

OpenCms 8.5 integrates Apache Solr. And not only for full text search, but as a powerful query engine as well.

Imagine you want to show a list of "all resources of type news, that have changed since yesterday, where property X has the value Y" on your web page. Sure, there are API methods in OpenCms to load resources based on the type, on the date of change, or on the value of a specific property. But for many common use case combinations, there is no single API call. This means if you create a collector, you often end up sorting out the results of the initial API query in code.

In this session, Rüdiger will show how Apache Solr has been integrated in OpenCms 8.5. He will explain how to create improved front-end full text search functions with advanced options like faceting and spell check suggestions. And he will explain how to use Solr to directly read resources from the OpenCms VFS, allowing query combinations that combine resource attributes, properties and content in a powerful new way.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

  1. 1. WORKSHOP TRACK Using Apache Solr to retrieve content25.09.2012 Rüdiger Kurz, Alkacon Software
  2. 2. Project Collaboration2
  3. 3. Agenda3 1. What is Solr? 2. Benefits 3. Searching 4. Indexing 5. Configuration
  4. 4. Retrieving data fast4 ●Apache Solr is hopefully not able to answer this question! ●BUT it will return the results in less than a second
  5. 5. What is Apache Solr?5 ● Solr is an enterprise search platform from the Apache Lucene project ● Solr is highly scalable, providing distributed search and index replication ● Solr powers the search and navigation features ● Major features include ● Powerful full-text search ● Hit highlighting ● Faceted search ● Rich document (e.g., Word, PDF) handling
  6. 6. What is faceted search?6 ● Faceted search is the dynamic clustering of items or search results into categories ● That let users drill into search results (or even skip searching entirely) ● Each facet displayed typically shows the number of hits that match that category ● Users can then “drill down” by applying specific constraints to the search results ● Faceted search is also called faceted browsing, faceted navigation, guided navigation and sometimes parametric search
  7. 7. What is Faceted Search?7 The breadcrumb trail shows what constraints have already been applied and allows their removal “Resource types” is a facet, a way of Regular search results categorizing the results containerpage, v8flwoer, v8textblock, … are constraints, or facet values The facet count shows how many results The tag bar shows other match each value facet values of the found document that can be applied
  8. 8. 8 Benefits
  9. 9. Database as bottleneck9 ● DBs are proprietary ● Require elaborate infrastructures ● SQL queries are hard to formulate ● SQL on DB is slower than search queries ● A lot SQL statements make DB to bottleneck ● Also lower traffic sites will slow to run when executing too many statements on DB layer  Overall performance starts to degrade
  10. 10. Content retrieval so far10 ● OpenCms stores the content in a RDBMS ● To access values of an XML content you have to perform the following steps: 1. Read the resource Resource (dates, refs, attr) 2. Read binary content Content (blob) 3. Un-marshal content Marshaled XML 4. Access with getters Java Access Bean
  11. 11. The new way of content retrieval11 ● “Read” whole resource content by a single query ● Increase ease of data structure by storing documents ● New flexibility by using power of Solr query syntax ● Best performance based on optimized index ● HTTP interface for external applications ● Secure, scalable and cost-effective access ● Reduced DB traffic and increased performance
  12. 12. OpenCms 8.5 Solr Integration
  13. 13. 13 Searching
  14. 14. Search with Solr in OpenCms14 ●Querying OpenCms content using the power of Solr’s query syntax 1. Send a HTTP request handler 2. Use the new Solr Collector 3. Call the Java API search method
  15. 15. OpenCms Solr handler15 ● The REST-like interface of Solr makes you able to access indexed documents over HTTP without any knowledge about CMS specific syntax ● A permission check is performed by OpenCms making sure no secure documents will be returned ● Using Solr based UI frameworks like “Ajax Solr” on your website without development costs ● Providing an open interface for external applications e.g. mobile applications
  16. 16. Examples: REST / JAVA / Collector16 http://localhost:8080/opencms/opencms/handleSolrSelect ?fq=type:v8flower 1 <cms:contentload collector="byQuery" 2 param="type:v8flower"> <cms:contentaccess var="content" /> ${content.value.Title} </cms:contentload> CmsObject cms = getCmsObject(); String query = "fq=type:v8flower"; 3 CmsSearchManager mananger = OpenCms.getSearchManager(); CmsSolrIndex index = manager.getIndexSolr("Solr Online"); CmsSolrResultList results =, query);
  17. 17. Live Demo17 Demo Demo デモ Demo Demo
  18. 18. 18 Indexing
  19. 19. Indexed data19 ● Data indexed by default (hard coded) ● Field configuration (opencms-search.xml) ● XSD field mapping (Content definition) ● Implement a custom field configuration (Java)
  20. 20. Solr schema20 ● The Schema file contains all of the details about which fields your documents can contain ● OpenCms uses an adjusted version of the schema.xml that is contained within Apache Solr standard distribution WEB-INF/solr/conf/schama.xml ● If you want to add a new custom field or field type for documents you can modify this file
  21. 21. Advantages of field types21 ● Types are checked during the index process ● It enables easy rage queries even for dates, what is real facilitation making dev-life easier ● Custom types can be added, e.g. key/value tuple or some special JSON fields
  22. 22. Default indexed data22 ● id - Structure id used as unique identifier for an document (The structure id of the resource) ● path - Full root path (The root path of the resource e.g. /sites/default/flower_en/.content/article.html) ● path_hierarchy - The full path as (path tokenized field type: text_path) ● parent-folders - Parent folders (multi-valued field containing an entry for each parent path) ● type - Type name (the resource type name) ● res_locales - Existing locale nodes for XML content and all available locales in case of binary files ● created - The creation date (The date when the resource itself has being created) ● lastmodified - The date last modified (The last modification date of the resource itself) ● contentdate - The content date (The date when the resources content has been modified) ● released - The release and expiration date of the resource ● content A general content field that holds all extracted resource data (all languages, type text_general) ● contentblob - The serialized extraction result toimprove the extraction performance while indexing ● category - All categories as general text ● category_exact - All categories as exact string for faceting reasons ● text_<locale> - Extracted textual content optimized for the language specific search ● timestamp - The time when the document was indexed last time ● *_prop - All properties of a resource as searchable and stored text (<Property_Definition_Name>_prop) ● *_exact - All properties of a resource as exact not stored string (<Property_Definition_Name>_exact)
  23. 23. XSD field mapping23 ● Additional field mappings for XML contents can now be configured directly within the XSD Schema ● Without modifying opencms-search.xml  No restart of the servlet container required <searchsetting element=“DisplayDate” searchcontent=“false”> <solrfield targetfield=“myDisplayDateField” sourcefield=“*_dt” /> </searchsetting> <searchsetting element=“Teaser”> <solrfield targetfield=“ateaser”> <mapping type=“item” default=“Homepage n.a.”>Homepage</mapping> <mapping type=“property-search”>search.special</mapping> <mapping type=“dynamic” class=“my.DynamicMapping”>special</mapping> </solrfield> </searchsetting>
  24. 24. 24 Configuration
  25. 25. Enable Solr in OpenCms25 ● When installing OpenCms v8.5 Solr will be enabled by default while Solr will be disabled after updating a system to OpenCms 8.5 ● To enable Solr in after updating you must create a Solr home directory in the WEB-INF folder of your OpenCms application ● Copy the solr/ folder from the OpenCms standard distribution as a starting point for your configuration ● All search configurations are done as usual in the opencms- search.xml below WEB-INF/config ● Adding the following lines will enable the Embedded Server <opencms><search> <solr enabled="true"/> […] </search></opencms>
  26. 26. Search index configuration26 ● You can add a custom Solr index with the known OpenCms search configuration syntax ● NOTE: class attributes are needed for the index and its field configuration <index class=""> <name>Solr Online</name> <rebuild>auto</rebuild> <project>Online</project> <locale>all</locale> <configuration>solr_fields</configuration> <sources> <source>solr_source</source> </sources> </index>
  27. 27. Create field configuration (1/3)27 ● For converting a field configuration by: 1. Copy a <filedconfiguration>-node 2. Change / set the class attribute 3. Optionally add a type attributes for fields <fieldconfiguration class=""> <name>example</name> <description>Converted Lucene Index</description> <field name="meta" store="false" index="true" type="en"> <mapping type="property">Title</mapping> <mapping type="property">Description</mapping> </field> </fields> </fieldconfiguration>
  28. 28. Create field configuration (2/3)28 ● As value for the type attribute of a field definition inside the opencms-system.xml you can use names of any dynamic field defined in the schema.xml ● For example: i - type=“int” dt - type=“date” txt - type=“text_general” en - type=“text_en” es - type=“text_es” fr - type=“text_fr”
  29. 29. Create field configuration (3/3)29 ● As previously said the field names are defined in the schema.xml <solr_name> of Solr, now we define additional fields inside the opencms- search.xml <opencms_name> ● How does that work? String fieldName = <opencms_name>_txt; if (existsInSolrSchema(fieldName)) { fieldName = <opencms_name>; } else if (isTypeAttributeSet()) { fieldName = <opencms_name>_<type>; }
  30. 30. Live Demo30 Demo Demo デモ Demo Demo
  31. 31. Future steps with IKS and Stanbol31 ● Having Solr and VIE integrated into OpenCms we are well prepared start using Apache Stanbol ● Stanbol is a top level Apache project ● Stanbol guarantees a quality standard ● Stanbol opens the perspective of sustainability ● We are looking to integrate Stanbol into OpenCms 9
  32. 32. Live Demo32 Demo Demo デモ Demo Demo
  33. 33. Integration Conclusion33 ● Permission checked search (secure) ● Solr Request handler (accessible) ● Solr Collector (integrated) ● Result highlighting (user-friendly) ● Configuration opportunities (flexible) ● Search field mapping (sensitive) ● Type based field schema (type-safe) ● Lucene conversion (compatible)
  34. 34. 34 Thank you very much for your attention! Rüdiger Kurz Alkacon Software GmbH
  35. 35. Any Questions?35 Questions? Fragen? 質問 ¿Preguntas? Questiones?