A presentation by Gaston Gonzalez at adaptTo() 2014 describing several approaches for integrating Apache Solr with AEM. It starts with an introduction to various pull and push indexing strategies (e.g., Sling Eventing, content publishing and web crawling). The topic of content ingestion is followed by an approach for delivering rapid search front-end experiences using AEM Solr Search.
4. Choosing a Data Ingestion Approach
Where is your content stored?
AEM/CQ
External databases and data stores
External feeds (RSS)
How many documents?
Small – Tens of Thousands
Medium – Hundreds of Thousands
Large – Millions+
adaptTo() 2014 4
5. Choosing a Data Ingestion Approach (2/3)
Index latency
Near real-time indexing (low latency)
Hourly, daily, weekly (high latency)
Do you need document enrichment?
Add additional metadata
Sanitize and transform data
Merge/join documents/records
adaptTo() 2014 5
6. Choosing a Data Ingestion Approach (3/3)
Implementation Complexity
Implementation effort
Time to market
Supporting infrastructure
Crawlers
Document Processing Platform
Messaging Queues
adaptTo() 2014 6
10. Solr Update Request Handler Pattern (3/3)
# Request from CQ a dump of the content in the Solr JSON update handler format
curl -s -u ${CQ_USER}:${CQ_PASS} -o ${SAVE_FILE}
http://localhost:4502/apps/geometrixx-media/
solr/updatehandler?type=${SLING_RESOURCE_TYPE}
# Perform delete by query
curl http://${SOLR_HOST}:${SOLR_PORT}/solr/${SOLR_CORE}/update?commit=true -H
"Content-Type: application/json" --data-binary '{"delete": { "query":"*:*" }}'
# Post the local JSON file that was saved in step 1 to Solr and commit
curl http://${SOLR_HOST}:${SOLR_PORT}/solr/${SOLR_CORE}/update?commit=true -H
"Content-Type: application/json" --data-binary @${SAVE_FILE}
adaptTo() 2014 10
11. HTML Selector + Crawler Pattern
Method: Pull Document Enrichment: Simple
Content Location: CQ, CQ + External Implementation Complexity: Low+
Document Corpus: Small, Medium Infrastructure: Crawler
Index Latency: Medium, High
Recipe
1. Implement selector to generate simplified HTML response.
2. Implement dynamic index page or selector for crawler use.
3. Configure crawler (Nutch) seed URL to use the dynamic index pages and crawl to
depth = 1.
4. Schedule crawler.
adaptTo() 2014 11
12. HTML Selector + Crawler Pattern (2/2)
/content/geometrixx-media/
en/events/big-sur-beach-party.
crawlerpage.html
<html>
<head>
<title>Big Sur Beach Party</title>
</head>
<body>
<h1>Big Sur Beach Party</h1>
<p>Some content here</p>
</body>
</html>
/content/geometrixx-media/
en/events.crawlerindex.html
(seed URL)
…
<ul>
<li><a href=“…”>Big Sur Beach Party</a></li>
<li><a href=“…”>The only Festival You Need</a></li>
…
</ul>
…
$ nutch crawl ${NUTCH_HOME}/urls -dir ${NUTCH_HOME}/crawl -threads 4 –depth 1 -topN 100-solr
http://${HOST}:${PORT}/solr/${CORE}
adaptTo() 2014 12
13. Event Listener Pattern
Method: Push Document Enrichment: Simple
Content Location: CQ, CQ + External* Implementation Complexity: Low
Document Corpus: Small, Medium, Large Infrastructure: None
Index Latency: Low
Recipe
1. Wrap SolrJ as an OSGi bundle.
2. Implement listener (JCR Observer, Sling eventing or workflow*) to listen for changes
to desired content.
3. On event, trigger appropriate SolrJ index call (add/update/delete).
adaptTo() 2014 13
Java-based, open-source enterprise search platform based on Apache Lucene
Framing the integration
Solr as a content aggregator (RDBMs, product information systems, external document repositories)
Solr/AEM to deliver to public-facing, dynamic, search-driven sites
More than search – dynamic modules / components
Guiding principle: Push all data into Solr
Pattern viewed though several dimensions: ingestion method, content location, etc.
Overview of Solr’s Update Request Handler
Cons
Performance issues if a JCR query is used to find all content
Additional Steps
Update the scrip to perform a poor man’s backup of your Solr JSON (i.e., ${SAVE_FILE}.<timestamp>.json)
Implement selector to generate simplified HTML response
http://localhost:4502/content/geometrixx-media/en.crawlerpage.html
Exclude boilerplate content (e.g., headers, footers navigation elements, etc.)
Some degree of document enrichment is possible here
Implement a dynamic index page for crawler use.
http://localhost:4502/content/geometrixx-media/en.crawlerindex.html
Simple list of html anchor tags that link to to corresponding page (i.e., crawlerpage)
Moving away from this pattern
Tips:
Use author instance for triggering both author and publish messages if multiple publishers are used.
For publish listen for replication events.
References:
Event Handling in CQ by Nicolas Peltier (http://experiencedelivers.adobe.com/cemblog/en/experiencedelivers/2012/04/event_handling_incq.html)
Messaging Middleware
RabbitMQ, ActiveMQ, AWS SQS (Amazon)
Message format
Depends on implementation. Some implementation such as SQS restrict the size of messages. In this scenario, include the following in your message:
Index Operation type (e.g., add, update, delete)
Document UID
Pointer to CQ page using a selector. The selector, “search.xml,” should present all the data for ingestion.
Pros
Content updates not lost if Solr is unavailable
Also for other systems to consume AEM content events
References
Document Processing Pipelines
Hydra (http://findwise.github.io/Hydra/)
Custom
Solr is not mature in this area. See http://wiki.apache.org/solr/DocumentProcessing.
Avoid UpdateRequestProcessor chain unless you have simple needs
Pros
Supports separate indexes for author and publish
Messaging queue allows for Solr index outages during CQ content updates
Document processing allows for advanced document enrichment prior to indexing
External SolrCloud deployment support just about all scaling needs (sharding and replicas)