Effective Searching by Dominik Kornas
Upcoming SlideShare
Loading in...5
×
 

Effective Searching by Dominik Kornas

on

  • 312 views

 

Statistics

Views

Total Views
312
Views on SlideShare
311
Embed Views
1

Actions

Likes
3
Downloads
19
Comments
0

1 Embed 1

http://www.slideee.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Effective Searching by Dominik Kornas Effective Searching by Dominik Kornas Presentation Transcript

  • Effective searching Integrating External Search Engines with Adobe AEM Dominik Kornaś
  • 3 years in Cognifide – exactly today  Senior software engineer & technical lead Focused on systems integration tasks The ”search guy” in Cognifide Who am I?
  • What we won’t talk about Sorting Document structure Indexing Managed relevancy model Input data processingHighlighter Faceted search Wildcard search Statistics Autocomplete Spellchecking Lemmatization Sentence search Pagination Content normalization Metadata Data collections & views
  • The goal of searching
  • „What is the best British football team?” If we ask such a question, will the search engine find the answer? The goal of searching
  • „What is the best British football team?” The search engine will find the question, not the answer. The goal of searching
  • „What is the best British football team?” vs. „best team football UK” Are we asking questions or issuing queries? The goal of searching
  • The goal of searching Effective searching is about finding keywords: • in the shortest possible time • close to each other in a block of text • that are in a desired context and being sure the engine knows about the data we are looking for!
  • Effective searchingIndexing
  • The Past
  • Microsoft FAST The first major external search integration with AEM (then: CQ 5.4) in Cognifide. Push-like indexing using CQ-FAST connector from Adobe.
  • Microsoft FAST Implemented as a dedicated replication agent, triggered by the content replication. http://wem.help.adobe.com/enterprise/en_US/10-0/wem/administering/cq2fast.html
  • Content builder Transport handler MS FAST Microsoft FAST Replication agent processing workflow: HTTP request for a content Metadata Markup
  • Microsoft FAST We can decide which instance the content should be read from.
  • Content builder Transport handler MS FAST Microsoft FAST Replication agent processing workflow: metadata.ecma evaluation Markup Metadata
  • Content builder Transport handler MS FAST Microsoft FAST Replication agent processing workflow: data upload Markup Metadata
  • Microsoft FAST Sends content to MS FAST. The ”cq5” suffix in the URI is a document collection. A named subset of documents in the entire FAST index. http://wem.help.adobe.com/enterprise/en_US/10-0/wem/administering/cq2fast.html
  • Content builder Transport handler MS FAST Microsoft FAST Replication agent processing workflow: indexing Markup Metadata
  • Microsoft FAST The replication agent is OK for one site, stored in a single FAST collection of documents. It becomes complicated in the multi-site environment where each site must be located in a separate index area. And when the search results should not contain data coming from the different sites.
  • Microsoft FAST
  • Microsoft FAST The complex ACL configuration has been used to ensure that only one proper agent will deliver the document to FAST. It was hard to set and maintain without the proper tools that have automated the whole process.
  • The Present Day
  • Google Search Appliance For the AEM & GSA integration, we have considered reusing of the CQ-FAST connector approach. But aware of the issues, we have decided to develop our own micro-framework that takes care about the indexing process. Installed as a single OSGi bundle. Provides a set of services and utilities to help with the indexing.
  • Google Search Appliance Content replication Filtering Push to Publish Indexing queue (-s) Content gathering Metadata processing Push to external engine The indexing process spans between the author and the publish AEM instances. All stages are tracked and it is possible to recover from the failure and retry the indexing. AuthorPublish Process status tracking & persistence
  • Google Search Appliance Content replication Filtering Push to Publish Indexing queue (-s) Content gathering Metadata processing Push to external engine The process starts with the content replication. OR Programatically from the backend, e.g. triggered by the scheduler service. AuthorPublish Process status tracking & persistence
  • Google Search Appliance Content replication Filtering Push to Publish Indexing queue (-s) Content gathering Metadata processing Push to external engine Each replicated content path is filtered against a whitelist & a blacklist. There’s an option to use a custom OSGi service able to decide if the content should be indexed, removed or ignored. AuthorPublish Process status tracking & persistence
  • Google Search Appliance Content replication Filtering Push to Publish Indexing queue (-s) Content gathering Metadata processing Push to external engine The indexing information is persisted in a special kind of repository node and replicated to the publish instance. We can choose which publish instance(-s) will receive the data. AuthorPublish Process status tracking & persistence
  • Google Search Appliance Content replication Filtering Push to Publish Indexing queue (-s) Content gathering Metadata processing Push to external engine The information is received and instantly dispatched to the indexing queue(-s). We can handle indexing in a single or multiple different search engines. AuthorPublish Process status tracking & persistence
  • Google Search Appliance Content replication Filtering Push to Publish Indexing queue (-s) Content gathering Metadata processing Push to external engine The content is gathered using the SlingRequestProcessor OSGi service. It’s like a request for an HTML page sent from the Java code and consumed by itself. AuthorPublish Process status tracking & persistence
  • Google Search Appliance Content replication Filtering Push to Publish Indexing queue (-s) Content gathering Metadata processing Push to external engine Metadata is collected according to multiple different rules: • the content resource type • the content path • values of the component properties • custom rules AuthorPublish Process status tracking & persistence
  • Google Search Appliance Content replication Filtering Push to Publish Indexing queue (-s) Content gathering Metadata processing Push to external engine The content and metadata are combined together and sent to the search engine. Depending on the implementation it can be done for each single document or in batches. AuthorPublish Process status tracking & persistence
  • Google Search Appliance Content replication Filtering Push to Publish Indexing queue (-s) Content gathering Metadata processing Failure or timeout Retry In case of any failure, indexing is rescheduled and launched again as many times as it is configured. If the server goes down, indexing will restart when the machine is up again. AuthorPublish Process status tracking & persistence
  • Google Search Appliance The flexible nature of our solution saved us when some fancy requirements came.
  • The Future
  • Apache Solr The search engine, which is: • free & open source • powerful • customizable • scalable And what is the most important, it is a part of the Jackrabbit Oak (JCR 3), the repository engine which has been used for AEM 6. AEM with the integrated Solr is right there.
  • Apache Solr The solution developed for GSA has been ported to work with Solr. Changes: • Replaced the ”glue code” that does the final data push, with one that uses SolrJ Java library. • Names of the document metadata fields has been changed to follow the Solr naming convention for dynamic fields. Everything else remained untouched.
  • Search driven components
  • Search driven components No server-side processing. Search engine used as a mini database of metadata. Configuration via query parameters. Pure front-end implementation.
  • Search driven components The whole page can be read from the dispatcher cache. An AJAX request gets the content directly from the search engine. The response is JSON-structured, easy to parse and to display, using JavaScript. { "id": "223344", "firstName": "Michael", "lastName": "Johnson", "phone": "(123)-777-8888", "office": "Office UK", "department": "504", "title": "Lead Architect" }
  • Search driven components Search results component configured to return employee data.
  • Search driven components User profile. The name, mobile, email, image path etc. are all metadata values of the document.
  • Search driven components Carousel with news. By changing the maximum number of search results, we can control the number of slides in the carousel.
  • Thank you!