SlideShare a Scribd company logo
1 of 26
APACHE SLING & FRIENDS TECH MEETUP 
BERLIN, 22-24 SEPTEMBER 2014 
Integrating Open Source Search with AEM 
Gaston Gonzalez
Agenda 
 Apache Solr 
 Data Ingestion 
 Drivers 
 Implementation Patterns 
 Searching Data 
adaptTo() 2014 2
Data Ingestion: Drivers 
adaptTo() 2014 3
Choosing a Data Ingestion Approach 
 Where is your content stored? 
 AEM/CQ 
 External databases and data stores 
 External feeds (RSS) 
 How many documents? 
 Small – Tens of Thousands 
 Medium – Hundreds of Thousands 
 Large – Millions+ 
adaptTo() 2014 4
Choosing a Data Ingestion Approach (2/3) 
 Index latency 
 Near real-time indexing (low latency) 
 Hourly, daily, weekly (high latency) 
 Do you need document enrichment? 
 Add additional metadata 
 Sanitize and transform data 
 Merge/join documents/records 
adaptTo() 2014 5
Choosing a Data Ingestion Approach (3/3) 
 Implementation Complexity 
 Implementation effort 
 Time to market 
 Supporting infrastructure 
 Crawlers 
 Document Processing Platform 
 Messaging Queues 
adaptTo() 2014 6
Data Ingestion: Implementation Patterns 
adaptTo() 2014 7
Solr Update Request Handler Pattern 
Method: Pull Document Enrichment: Simple 
Content Location: CQ, CQ + External Implementation Complexity: Low 
Document Corpus: Small Infrastructure: None 
Index Latency: High 
Recipe 
1. Implement servlet to generate same output as a Solr’s update handler (i.e., JSON). 
2. Create script to invoke servlet (curl) and save response to file system. 
3. Update index by posting (curl) documents to Solr request handler (i.e., 
JsonUpdateRequestHandler). 
4. Call script from scheduler (cron). 
adaptTo() 2014 8
Solr Update Request Handler Pattern (2/3) 
adaptTo() 2014 9
Solr Update Request Handler Pattern (3/3) 
# Request from CQ a dump of the content in the Solr JSON update handler format 
curl -s -u ${CQ_USER}:${CQ_PASS} -o ${SAVE_FILE} 
http://localhost:4502/apps/geometrixx-media/ 
solr/updatehandler?type=${SLING_RESOURCE_TYPE} 
# Perform delete by query 
curl http://${SOLR_HOST}:${SOLR_PORT}/solr/${SOLR_CORE}/update?commit=true -H 
"Content-Type: application/json" --data-binary '{"delete": { "query":"*:*" }}' 
# Post the local JSON file that was saved in step 1 to Solr and commit 
curl http://${SOLR_HOST}:${SOLR_PORT}/solr/${SOLR_CORE}/update?commit=true -H 
"Content-Type: application/json" --data-binary @${SAVE_FILE} 
adaptTo() 2014 10
HTML Selector + Crawler Pattern 
Method: Pull Document Enrichment: Simple 
Content Location: CQ, CQ + External Implementation Complexity: Low+ 
Document Corpus: Small, Medium Infrastructure: Crawler 
Index Latency: Medium, High 
Recipe 
1. Implement selector to generate simplified HTML response. 
2. Implement dynamic index page or selector for crawler use. 
3. Configure crawler (Nutch) seed URL to use the dynamic index pages and crawl to 
depth = 1. 
4. Schedule crawler. 
adaptTo() 2014 11
HTML Selector + Crawler Pattern (2/2) 
/content/geometrixx-media/ 
en/events/big-sur-beach-party. 
crawlerpage.html 
<html> 
<head> 
<title>Big Sur Beach Party</title> 
</head> 
<body> 
<h1>Big Sur Beach Party</h1> 
<p>Some content here</p> 
</body> 
</html> 
/content/geometrixx-media/ 
en/events.crawlerindex.html 
(seed URL) 
… 
<ul> 
<li><a href=“…”>Big Sur Beach Party</a></li> 
<li><a href=“…”>The only Festival You Need</a></li> 
… 
</ul> 
… 
$ nutch crawl ${NUTCH_HOME}/urls -dir ${NUTCH_HOME}/crawl -threads 4 –depth 1 -topN 100-solr 
http://${HOST}:${PORT}/solr/${CORE} 
adaptTo() 2014 12
Event Listener Pattern 
Method: Push Document Enrichment: Simple 
Content Location: CQ, CQ + External* Implementation Complexity: Low 
Document Corpus: Small, Medium, Large Infrastructure: None 
Index Latency: Low 
Recipe 
1. Wrap SolrJ as an OSGi bundle. 
2. Implement listener (JCR Observer, Sling eventing or workflow*) to listen for changes 
to desired content. 
3. On event, trigger appropriate SolrJ index call (add/update/delete). 
adaptTo() 2014 13
Event Listener Pattern (2/2) 
adaptTo() 2014 14
Event Listener + Messaging Pattern 
Method: Push Document Enrichment: Simple 
Content Location: CQ, CQ + External* Implementation Complexity: Med+High 
Document Corpus: Medium, Large Infrastructure: Messaging Middleware 
Index Latency: Low 
Recipe 
1. Wrap messaging middleware client as an OSGi bundle. 
2. Implement listener (JCR Observer, Sling eventing or workflow) to listen for changes 
to desired content. 
3. On event, write to message* to queue. 
4. External process polls queue and performs index update. 
* If messaging platform restricts message size, consider passing URL pointer to an ingestion friendly page. 
adaptTo() 2014 15
Event Listener + Messaging Pattern (2/2) 
Message Format 
{ 
"operation":”<ADD|UPDATE|DELETE>", 
"timeStamp":"<UTC timestamp>", 
"path":"/content/geometrixx-media/en/events/big-sur-beach-party.search.xml", 
"uid”:"<uid>", 
"docUrl":http://<host>/content/geometrixx-media/en/events/big-sur-beach-party.search.json" 
} 
Document Manifest URL 
{ 
"id" : ”<uid>", 
"title" : ”Big Sur Beach Party”, 
… 
} 
adaptTo() 2014 16
Event Listener + Messaging + DP Pattern 
Method: Push Document Enrichment: Complex 
Content Location: CQ, CQ + External Implementation Complexity: High 
Document Corpus: Medium, Large Infrastructure: Messaging Middleware, 
Index Latency: Low Document Processing 
Recipe 
1. Same as Event Listener + Messaging Pattern (steps 1-3) 
2. Read message from document processing platform. 
3. Perform document transformation and enrichment. 
4. Update index from document processing platform using SolrJ. 
adaptTo() 2014 17
Enterprise Deployment Architecture 
adaptTo() 2014 18
Searching Data 
adaptTo() 2014 19
AEM Solr Search 
 Open source Apache Solr / AEM integration 
 Allows content authors and developers to build rapid search interfaces 
 Provides rich set of search UI components 
 Search results, facets, search input, statistics, pagination, etc. 
 Server-side and client-side implementation 
 Implementations based on Bootstrap and AJAX Solr 
 Geometrixx Media Sample (JSON update handler, eventing, …) 
adaptTo() 2014 20
AEM Solr Search (2/3) 
Clone it from Git. 
$ git clone https://github.com/headwirecom/aem-solr-search.git 
$ cd aem-solr-search 
Deploy AEM Solr Search to AEM 
$ mvn clean install -Pauto-deploy-all 
Deploy GeometrixxMedia Sample 
$ mvn install -Pauto-deploy-sample 
Start local Solr instance 
$ cd aemsolrsearch-quickstart 
$ mvn clean resources:resources jetty:run 
Index GeometrixxMedia Articles (launch new terminal) 
$ cd ../aemsolrsearch-geometrixx-media-sample 
$ ./index-geometrixx-media-articles.sh 
adaptTo() 2014 21
AEM Solr Search (3/3) 
adaptTo() 2014 22
AEM Solr Search (3/3) 
Demo 
http://www.aemsolrsearch.com/#/demo 
Or 
https://www.youtube.com/watch?v=TUhMbD4DUSU 
adaptTo() 2014 23
Appendix 
adaptTo() 2014 24
Resources 
Sites 
 http://www.aemsolrsearch.com/ 
 http://www.gastongonzalez.com/ 
Code & Samples 
 https://github.com/headwirecom/aem-solr-search 
adaptTo() 2014 25
Contact Us 
Support: info@headwire.com 
Email: gg@headwire.com 
Twitter: therealgaston 
adaptTo() 2014 26

More Related Content

What's hot

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 

What's hot (20)

Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Ahsay Backup Software v7 - Datasheet
Ahsay Backup Software v7 - DatasheetAhsay Backup Software v7 - Datasheet
Ahsay Backup Software v7 - Datasheet
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for You
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, TruliaReal Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 

Viewers also liked

Viewers also liked (20)

Séminaire BASTIA 2017
Séminaire BASTIA 2017Séminaire BASTIA 2017
Séminaire BASTIA 2017
 
Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?
 
Frans Mäyrä: Tosi ja virtuaalinen - liikettä maailmojen välillä
Frans Mäyrä: Tosi ja virtuaalinen - liikettä maailmojen välilläFrans Mäyrä: Tosi ja virtuaalinen - liikettä maailmojen välillä
Frans Mäyrä: Tosi ja virtuaalinen - liikettä maailmojen välillä
 
Active Directory Delegation - By @rebootuser
Active Directory Delegation - By @rebootuserActive Directory Delegation - By @rebootuser
Active Directory Delegation - By @rebootuser
 
That Conference Date and Time
That Conference Date and TimeThat Conference Date and Time
That Conference Date and Time
 
Date and Time MomentJS Edition
Date and Time MomentJS EditionDate and Time MomentJS Edition
Date and Time MomentJS Edition
 
公開用EU一般データ保護規則への実務対応
公開用EU一般データ保護規則への実務対応 公開用EU一般データ保護規則への実務対応
公開用EU一般データ保護規則への実務対応
 
Demystifying Oak Search
Demystifying Oak SearchDemystifying Oak Search
Demystifying Oak Search
 
Elastic search adaptto2014
Elastic search adaptto2014Elastic search adaptto2014
Elastic search adaptto2014
 
Metro Design
Metro DesignMetro Design
Metro Design
 
Implementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEMImplementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEM
 
Creating Real-Time Data Mashups with Node.JS and Adobe CQ
Creating Real-Time Data Mashups with Node.JS and Adobe CQCreating Real-Time Data Mashups with Node.JS and Adobe CQ
Creating Real-Time Data Mashups with Node.JS and Adobe CQ
 
The Business Place Botswana
The Business Place BotswanaThe Business Place Botswana
The Business Place Botswana
 
SAGES 2017
SAGES 2017SAGES 2017
SAGES 2017
 
Algunas enfermedades psiquiátricas
Algunas enfermedades psiquiátricasAlgunas enfermedades psiquiátricas
Algunas enfermedades psiquiátricas
 
4 q16 presentation final
4 q16 presentation   final4 q16 presentation   final
4 q16 presentation final
 
Measuring cultural health and well-being in the workplace.
Measuring cultural health and well-being in the workplace.Measuring cultural health and well-being in the workplace.
Measuring cultural health and well-being in the workplace.
 
Brand spotter analytics_price_aug_2013
Brand spotter analytics_price_aug_2013Brand spotter analytics_price_aug_2013
Brand spotter analytics_price_aug_2013
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 

Similar to adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

PHP Continuous Data Processing
PHP Continuous Data ProcessingPHP Continuous Data Processing
PHP Continuous Data Processing
Michael Peacock
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
DataWorks Summit
 
EAS Data Flow lessons learnt
EAS Data Flow lessons learntEAS Data Flow lessons learnt
EAS Data Flow lessons learnt
euc-dm-test
 
Consuming RESTful services in PHP
Consuming RESTful services in PHPConsuming RESTful services in PHP
Consuming RESTful services in PHP
Zoran Jeremic
 

Similar to adaptTo() 2014 - Integrating Open Source Search with CQ/AEM (20)

Interoperability Fundamentals: SWORD 2
Interoperability Fundamentals: SWORD 2Interoperability Fundamentals: SWORD 2
Interoperability Fundamentals: SWORD 2
 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management Service
 
PHP Continuous Data Processing
PHP Continuous Data ProcessingPHP Continuous Data Processing
PHP Continuous Data Processing
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
 
Netflix conductor
Netflix conductorNetflix conductor
Netflix conductor
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Eclipse Training - Main eclipse ecosystem classes
Eclipse Training - Main eclipse ecosystem classesEclipse Training - Main eclipse ecosystem classes
Eclipse Training - Main eclipse ecosystem classes
 
EAS Data Flow lessons learnt
EAS Data Flow lessons learntEAS Data Flow lessons learnt
EAS Data Flow lessons learnt
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Sword Crig 2007 12 06
Sword Crig 2007 12 06Sword Crig 2007 12 06
Sword Crig 2007 12 06
 
Overview of RESTful web services
Overview of RESTful web servicesOverview of RESTful web services
Overview of RESTful web services
 
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
 
Lightweight Deposit using SWORD
Lightweight Deposit using SWORDLightweight Deposit using SWORD
Lightweight Deposit using SWORD
 
Consuming RESTful Web services in PHP
Consuming RESTful Web services in PHPConsuming RESTful Web services in PHP
Consuming RESTful Web services in PHP
 
Consuming RESTful services in PHP
Consuming RESTful services in PHPConsuming RESTful services in PHP
Consuming RESTful services in PHP
 
Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)
 
Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015
 
Devfest 2023 - Service Weaver Introduction - Taipei.pdf
Devfest 2023 - Service Weaver Introduction - Taipei.pdfDevfest 2023 - Service Weaver Introduction - Taipei.pdf
Devfest 2023 - Service Weaver Introduction - Taipei.pdf
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

  • 1. APACHE SLING & FRIENDS TECH MEETUP BERLIN, 22-24 SEPTEMBER 2014 Integrating Open Source Search with AEM Gaston Gonzalez
  • 2. Agenda  Apache Solr  Data Ingestion  Drivers  Implementation Patterns  Searching Data adaptTo() 2014 2
  • 3. Data Ingestion: Drivers adaptTo() 2014 3
  • 4. Choosing a Data Ingestion Approach  Where is your content stored?  AEM/CQ  External databases and data stores  External feeds (RSS)  How many documents?  Small – Tens of Thousands  Medium – Hundreds of Thousands  Large – Millions+ adaptTo() 2014 4
  • 5. Choosing a Data Ingestion Approach (2/3)  Index latency  Near real-time indexing (low latency)  Hourly, daily, weekly (high latency)  Do you need document enrichment?  Add additional metadata  Sanitize and transform data  Merge/join documents/records adaptTo() 2014 5
  • 6. Choosing a Data Ingestion Approach (3/3)  Implementation Complexity  Implementation effort  Time to market  Supporting infrastructure  Crawlers  Document Processing Platform  Messaging Queues adaptTo() 2014 6
  • 7. Data Ingestion: Implementation Patterns adaptTo() 2014 7
  • 8. Solr Update Request Handler Pattern Method: Pull Document Enrichment: Simple Content Location: CQ, CQ + External Implementation Complexity: Low Document Corpus: Small Infrastructure: None Index Latency: High Recipe 1. Implement servlet to generate same output as a Solr’s update handler (i.e., JSON). 2. Create script to invoke servlet (curl) and save response to file system. 3. Update index by posting (curl) documents to Solr request handler (i.e., JsonUpdateRequestHandler). 4. Call script from scheduler (cron). adaptTo() 2014 8
  • 9. Solr Update Request Handler Pattern (2/3) adaptTo() 2014 9
  • 10. Solr Update Request Handler Pattern (3/3) # Request from CQ a dump of the content in the Solr JSON update handler format curl -s -u ${CQ_USER}:${CQ_PASS} -o ${SAVE_FILE} http://localhost:4502/apps/geometrixx-media/ solr/updatehandler?type=${SLING_RESOURCE_TYPE} # Perform delete by query curl http://${SOLR_HOST}:${SOLR_PORT}/solr/${SOLR_CORE}/update?commit=true -H "Content-Type: application/json" --data-binary '{"delete": { "query":"*:*" }}' # Post the local JSON file that was saved in step 1 to Solr and commit curl http://${SOLR_HOST}:${SOLR_PORT}/solr/${SOLR_CORE}/update?commit=true -H "Content-Type: application/json" --data-binary @${SAVE_FILE} adaptTo() 2014 10
  • 11. HTML Selector + Crawler Pattern Method: Pull Document Enrichment: Simple Content Location: CQ, CQ + External Implementation Complexity: Low+ Document Corpus: Small, Medium Infrastructure: Crawler Index Latency: Medium, High Recipe 1. Implement selector to generate simplified HTML response. 2. Implement dynamic index page or selector for crawler use. 3. Configure crawler (Nutch) seed URL to use the dynamic index pages and crawl to depth = 1. 4. Schedule crawler. adaptTo() 2014 11
  • 12. HTML Selector + Crawler Pattern (2/2) /content/geometrixx-media/ en/events/big-sur-beach-party. crawlerpage.html <html> <head> <title>Big Sur Beach Party</title> </head> <body> <h1>Big Sur Beach Party</h1> <p>Some content here</p> </body> </html> /content/geometrixx-media/ en/events.crawlerindex.html (seed URL) … <ul> <li><a href=“…”>Big Sur Beach Party</a></li> <li><a href=“…”>The only Festival You Need</a></li> … </ul> … $ nutch crawl ${NUTCH_HOME}/urls -dir ${NUTCH_HOME}/crawl -threads 4 –depth 1 -topN 100-solr http://${HOST}:${PORT}/solr/${CORE} adaptTo() 2014 12
  • 13. Event Listener Pattern Method: Push Document Enrichment: Simple Content Location: CQ, CQ + External* Implementation Complexity: Low Document Corpus: Small, Medium, Large Infrastructure: None Index Latency: Low Recipe 1. Wrap SolrJ as an OSGi bundle. 2. Implement listener (JCR Observer, Sling eventing or workflow*) to listen for changes to desired content. 3. On event, trigger appropriate SolrJ index call (add/update/delete). adaptTo() 2014 13
  • 14. Event Listener Pattern (2/2) adaptTo() 2014 14
  • 15. Event Listener + Messaging Pattern Method: Push Document Enrichment: Simple Content Location: CQ, CQ + External* Implementation Complexity: Med+High Document Corpus: Medium, Large Infrastructure: Messaging Middleware Index Latency: Low Recipe 1. Wrap messaging middleware client as an OSGi bundle. 2. Implement listener (JCR Observer, Sling eventing or workflow) to listen for changes to desired content. 3. On event, write to message* to queue. 4. External process polls queue and performs index update. * If messaging platform restricts message size, consider passing URL pointer to an ingestion friendly page. adaptTo() 2014 15
  • 16. Event Listener + Messaging Pattern (2/2) Message Format { "operation":”<ADD|UPDATE|DELETE>", "timeStamp":"<UTC timestamp>", "path":"/content/geometrixx-media/en/events/big-sur-beach-party.search.xml", "uid”:"<uid>", "docUrl":http://<host>/content/geometrixx-media/en/events/big-sur-beach-party.search.json" } Document Manifest URL { "id" : ”<uid>", "title" : ”Big Sur Beach Party”, … } adaptTo() 2014 16
  • 17. Event Listener + Messaging + DP Pattern Method: Push Document Enrichment: Complex Content Location: CQ, CQ + External Implementation Complexity: High Document Corpus: Medium, Large Infrastructure: Messaging Middleware, Index Latency: Low Document Processing Recipe 1. Same as Event Listener + Messaging Pattern (steps 1-3) 2. Read message from document processing platform. 3. Perform document transformation and enrichment. 4. Update index from document processing platform using SolrJ. adaptTo() 2014 17
  • 20. AEM Solr Search  Open source Apache Solr / AEM integration  Allows content authors and developers to build rapid search interfaces  Provides rich set of search UI components  Search results, facets, search input, statistics, pagination, etc.  Server-side and client-side implementation  Implementations based on Bootstrap and AJAX Solr  Geometrixx Media Sample (JSON update handler, eventing, …) adaptTo() 2014 20
  • 21. AEM Solr Search (2/3) Clone it from Git. $ git clone https://github.com/headwirecom/aem-solr-search.git $ cd aem-solr-search Deploy AEM Solr Search to AEM $ mvn clean install -Pauto-deploy-all Deploy GeometrixxMedia Sample $ mvn install -Pauto-deploy-sample Start local Solr instance $ cd aemsolrsearch-quickstart $ mvn clean resources:resources jetty:run Index GeometrixxMedia Articles (launch new terminal) $ cd ../aemsolrsearch-geometrixx-media-sample $ ./index-geometrixx-media-articles.sh adaptTo() 2014 21
  • 22. AEM Solr Search (3/3) adaptTo() 2014 22
  • 23. AEM Solr Search (3/3) Demo http://www.aemsolrsearch.com/#/demo Or https://www.youtube.com/watch?v=TUhMbD4DUSU adaptTo() 2014 23
  • 25. Resources Sites  http://www.aemsolrsearch.com/  http://www.gastongonzalez.com/ Code & Samples  https://github.com/headwirecom/aem-solr-search adaptTo() 2014 25
  • 26. Contact Us Support: info@headwire.com Email: gg@headwire.com Twitter: therealgaston adaptTo() 2014 26

Editor's Notes

  1. Java-based, open-source enterprise search platform based on Apache Lucene Framing the integration Solr as a content aggregator (RDBMs, product information systems, external document repositories) Solr/AEM to deliver to public-facing, dynamic, search-driven sites More than search – dynamic modules / components
  2. Guiding principle: Push all data into Solr
  3. Pattern viewed though several dimensions: ingestion method, content location, etc. Overview of Solr’s Update Request Handler Cons Performance issues if a JCR query is used to find all content
  4. http://localhost:8080/solr/#/collection1/documents
  5. Additional Steps Update the scrip to perform a poor man’s backup of your Solr JSON (i.e., ${SAVE_FILE}.<timestamp>.json)
  6. Implement selector to generate simplified HTML response http://localhost:4502/content/geometrixx-media/en.crawlerpage.html Exclude boilerplate content (e.g., headers, footers navigation elements, etc.) Some degree of document enrichment is possible here Implement a dynamic index page for crawler use. http://localhost:4502/content/geometrixx-media/en.crawlerindex.html Simple list of html anchor tags that link to to corresponding page (i.e., crawlerpage) Moving away from this pattern
  7. Tips: Use author instance for triggering both author and publish messages if multiple publishers are used. For publish listen for replication events. References: Event Handling in CQ by Nicolas Peltier (http://experiencedelivers.adobe.com/cemblog/en/experiencedelivers/2012/04/event_handling_incq.html)
  8. Messaging Middleware RabbitMQ, ActiveMQ, AWS SQS (Amazon) Message format Depends on implementation. Some implementation such as SQS restrict the size of messages. In this scenario, include the following in your message: Index Operation type (e.g., add, update, delete) Document UID Pointer to CQ page using a selector. The selector, “search.xml,” should present all the data for ingestion. Pros Content updates not lost if Solr is unavailable
  9. Also for other systems to consume AEM content events
  10. References Document Processing Pipelines Hydra (http://findwise.github.io/Hydra/) Custom Solr is not mature in this area. See http://wiki.apache.org/solr/DocumentProcessing. Avoid UpdateRequestProcessor chain unless you have simple needs
  11. Pros Supports separate indexes for author and publish Messaging queue allows for Solr index outages during CQ content updates Document processing allows for advanced document enrichment prior to indexing External SolrCloud deployment support just about all scaling needs (sharding and replicas)
  12. Also Includes: HTTP proxy, SolrJ bundle