Super Size Your Search

1,556 views
1,427 views

Published on

As organisations store more and more information in their Alfresco content hubs, search and discovery of content becomes important. Alfresco comes bundled with Apache Lucene and Apache Solr for search. Although these provide full text capabilities, they do not have the scalability and functionality of the newer cloud scalable search software such as Apache Solr Cloud 4, Elastic Search and Amazon Cloud Search. Also, searching across multiple Alfresco instances including Alfresco Cloud is quite a challenge and any of the possible approaches are not good enough to be production ready.

This talk shows you how to index and search content stored in one or more Alfresco repositories, other CMIS repositories or file systems using either Apache Solr Cloud 4, Elastic Search or Amazon Cloud Search, while still ensuring the confidentiality of the documents based on the permissions configured in Alfresco or any other repositories.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,556
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
20
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Super Size Your Search

  1. 1. #SummitNow Super Size Your Search 6th November 2013 Piergiorgio Lucidi (Sourcesense) Fran Alvarez (Zaizi)
  2. 2. #SummitNow#SummitNow Piergiorgio Lucidi • Open Source ECM Specialist at Sourcesense • Alfresco Certified Trainer / Engineer • Alfresco Wiki Gardener / Community Star • Alfresco forum supporter • Global Moderator of the italian forum • Author and Technical Reviewer at Packt • PMC Member and Mentor at ASF • Project Leader in the JBoss Community
  3. 3. #SummitNow#SummitNow Overview How to build and manage your search server: 1. Scenario 2. Introducing Apache ManifoldCF 3. Zaizi Integrated Search Solution
  4. 4. #SummitNow#SummitNow Scenario An overview about the typical complex search architecture
  5. 5. #SummitNow#SummitNow Scenario - Alfresco limitations Alfresco supports these search engines: • Apache Lucene (embedded) • Apache Solr (provided by Alfresco) • needs development if other repositories must be involved Every other approach must be implemented (ScheduledActions, WebScripts, etc..)
  6. 6. #SummitNow#SummitNow Scenario – Embedded Simple Search Architecture Alfresco is the only one repository involved in the architecture using the embedded search engine: • the repository must take care of indexes also managing index transactions Indexes Alfresco FrontEnd applications Apache Lucene
  7. 7. #SummitNow#SummitNow Scenario – Embedded - Cluster Embedded Not easy to scale out with Lucene 1. every cluster must have its own search indexes 2. The cluster must synchronize indexes Indexes Alfresco Apache Lucene Indexes Alfresco Apache Lucene JGroups
  8. 8. #SummitNow#SummitNow Scenario – Simple Architecture Simple search architecture Alfresco is the only one repository involved in the architecture with an external search server 1. The search server can be used for publish contents in the front end architecture 2. The repository will stay in the logic backend Search Engine Indexes Alfresco FrontEnd applications
  9. 9. #SummitNow#SummitNow Scenario – Publish with search A search engine can be used for: • advanced management of search indexes • scaling out • executing complex search on contents • publishing contents in the FE architecture
  10. 10. #SummitNow#SummitNow Scenario – Publish with search Publish with search architecture Alfresco is the only one repository involved in the architecture with an external search server 1. The search server can be used for publishing contents in the front end architecture (HTML) 2. The repository will stay in the logic backend Search Engine Indexes Alfresco FrontEnd applications BackEnd FrontEnd Lucene / Solr Indexes
  11. 11. #SummitNow#SummitNow Scenario – Simple Architecture Simple Search Architecture Alfresco is the only one repository involved in the architecture with an external search server 1. The search server can be used for publish contents in the front end architecture 2. The repository will stay in the logic backend Search Engine Indexes Alfresco FrontEnd applications
  12. 12. #SummitNow#SummitNow Scenario – Complex Architecture 1. Alfresco is only one of the platforms that must be involved in your search architecture 2. You don’t want to increase the development effort 3. You want just something to configure 
  13. 13. #SummitNow#SummitNow Scenario – Complex Architecture Architecture with different ECM systems Alfresco is one of the content platforms that must be involved in the indexing process Alfresco Search Engine Indexes SharePoint FileNet CMIS JIRA Google Drive DropBox
  14. 14. #SummitNow#SummitNow Scenario – Complex Architecture Architecture with different ECM systems Alfresco is one of the content platforms that must be involved in the indexing process Alfresco Search Engine Indexes SharePoint FileNet CMIS JIRA Google Drive DropBox
  15. 15. #SummitNow#SummitNow Scenario – Complex Architecture Architecture with different ECM systems Alfresco is one of the content platforms that must be involved in the indexing process Alfresco Search Engine Indexes SharePoint FileNet CMIS JIRA Google Drive DropBox
  16. 16. #SummitNow#SummitNow Introducing Apache ManifoldCF
  17. 17. #SummitNow#SummitNow Apache ManifoldCF - History ManifoldCF code base was granted by MetaCarta to the Apache Software Foundation in December 2009. The MetaCarta effort represented more than five years of successful development and testing in multiple, challenging enterprise environments. The project was graduated as Apache Top Level Project in July 2012.
  18. 18. #SummitNow#SummitNow Apache ManifoldCF – What is? Open Source crawler • crawling model (add, change, delete) • schedule jobs to create indexes • get contents from repositories • push contents on search servers
  19. 19. #SummitNow#SummitNow Apache ManifoldCF – What is? Repository 1 Repository 3 Repository 4 Repository 2 Apache ManifoldCF Search Server 1 Search Server 2 Search Server 3 Search Server 4
  20. 20. #SummitNow#SummitNow Apache ManifoldCF – What is? Out-Of-The-Box it is distributed as a webapp • REST API • Authority Service • ACL indexes • Crawler UI can be embedded in any Java application
  21. 21. #SummitNow#SummitNow Apache ManifoldCF – Why? • Reliability • Incremental • Flexible • Multi repositories • Security model • Monitoring
  22. 22. #SummitNow#SummitNow ManifoldCF – Why? - Reliability Jobs scheduling and configuration are stored in the database to maintain the state of all the executions Repository 1 Repository 3 Repository 4 Repository 2 Apache ManifoldCF Search Server 1 Search Server 2 Search Server 3 Search Server 4 Pull Agent Daemon Database
  23. 23. #SummitNow#SummitNow ManifoldCF – Why? - Incremental get content changesets obtained from the repository API Repository 1 Apache ManifoldCF Pull Agent Daemon Database query Complete Changesets
  24. 24. #SummitNow#SummitNow ManifoldCF – Why? - Flexible If the repository can't supply all the changes Manifold can discover them through crawling Apache ManifoldCF Pull Agent Daemon Database query Incomplete Changesets Change Discovery N N
  25. 25. #SummitNow#SummitNow ManifoldCF – Why? – Multi repo Jobs can retrieve contents from the following repositories: • Google Drive • Dropbox • HDFS • CMIS-compliant • Alfresco • IBM FileNet • EMC Documentum • Microsoft SharePoint • OpenText LiveLink • Autonomy Meridio • Memex Patriarch • Windows Share/DFS • Generic JDBC • Generic Filesystem • Generic RSS and Web
  26. 26. #SummitNow#SummitNow ManifoldCF – Why? – Multi repo Jobs can ingest contents to the following search servers: • Apache Solr • ElasticSearch • OpenSearchServer • MetaCarta GTS
  27. 27. #SummitNow#SummitNow ManifoldCF – Why? - Security Retrieve per-content ACLs Repository 1 Repository 3 Repository 4 Repository 2 Apache ManifoldCF Search Server 1 Search Server 2 Search Server 3 Search Server 4 Authority Service Authority 1 Authority 2 access tokens
  28. 28. #SummitNow#SummitNow ManifoldCF – Why? - Security Retrieve per-content ACLs Repository 1 Repository 3 Repository 4 Repository 2 Apache ManifoldCF Search Server 1 Search Server 2 Search Server 3 Search Server 4 Authority Service Authority 1 Authority 2 user access tokens user specific search results
  29. 29. #SummitNow#SummitNow ManifoldCF – Why? – MonitoringUI Crawler allows you to: • configure jobs and connectors • monitor jobs execution • monitor contents ingestion • status reports • document status • queue status • history reports • simple history • maximum activity • maximum bandwidth • result histogram
  30. 30. #SummitNow#SummitNow ManifoldCF – Architecture Repository Job Search Server ACLs
  31. 31. #SummitNow#SummitNow ManifoldCF – Architecture Repository Job Search Server ACLs Repository Connector
  32. 32. #SummitNow#SummitNow ManifoldCF – Architecture Repository Job Search Server ACLs Repository Connector Output Connector
  33. 33. #SummitNow#SummitNow ManifoldCF – Architecture Repository Job Search Server ACLs Repository Connector Output Connector Authority Connector
  34. 34. #SummitNow#SummitNow ManifoldCF – Architecture Repository Job Search Server ACLs Repository Connector query to retrieve contents Output Connector Authority Connector
  35. 35. #SummitNow#SummitNow ManifoldCF – Architecture Repository Job Search Server ACLs Repository Connector query to retrieve contents Output Connector metadata mapping content ingestion Authority Connector
  36. 36. #SummitNow#SummitNow ManifoldCF – Architecture Repository Job Search Server ACLs Repository Connector query to retrieve contents Output Connector metadata mapping content ingestion Authority Connector retrieve content ACEs
  37. 37. #SummitNow#SummitNow ManifoldCF – Architecture Repository Job Search Server ACLs Repository Connector query to retrieve contents Output Connector metadata mapping content ingestion Authority Connector retrieve content ACEs • verbal description • crawling model • scheduling
  38. 38. #SummitNow#SummitNow Who is using ManifoldCF?
  39. 39. #SummitNow#SummitNow ManifoldCF - Resources The project is available at http://manifoldcf.apache.org/ From this website you can access to the mailing lists, documentation and download links for binaries and source.
  40. 40. #SummitNow#SummitNow ManifoldCF – Resources - Book ManifoldCF in Action by Karl Wright published by Manning Karl is the original developer and the principal committer of Apache ManifoldCF The book is available at http://www.manning.com/wright
  41. 41. #SummitNow#SummitNow Zaizi Integrated Search Solution
  42. 42. #SummitNow#SummitNow Fran Alvarez • Director of Zaizi Iberia and Lead Architect • Alfresco Certified Engineer • Responsible of large Alfresco architectures • Semantic Consultant for Sensefy • Alfresco Meetups Organizer
  43. 43. #SummitNow#SummitNow Alfresco + Solr Approach Quite a good architecture • Performance issues are solved • Different architectures depending on business requirements However… • It does not cover some use cases or scenarios • It does not leverage Cloud benefits or latest technologies • With huge data volume there are other approaches How can we solve limitations and enhance benefits?
  44. 44. #SummitNow#SummitNow Alfresco + Solr Approach • Decouples Search solution from Alfresco • Allow to implement different Search solutions • Allow to change Search solution without changing anything in Alfresco • Not even a property! • Provides an API to integrate it with Alfresco as search engine • Even other repository vendors! E.g. Filesystem, Sharepoint, Documentum, Filenet, Drupal… • And preserve security permissions in the results • Alfresco permissions are indexed and used during search It’s included in our Semantic solution: Sensefy!
  45. 45. #SummitNow#SummitNow What we’ve done in Manifold Repository Connector: • Alfresco Repository Connector: New implementation • Removing dependency with Alfresco Solr API Output connectors: • Cloud Search Output Connector: Design & Development • Elastic Search Output Connector: Improvements • Solr Cloud Output Connector: Configuration for Alfresco Authority Connector • Alfresco Authority Connector: Design & Development • Similar approach to Alfresco Solr • Acl reads for Users and Groups in Alfresco
  46. 46. #SummitNow#SummitNow Scenarios Let’s see some examples
  47. 47. #SummitNow#SummitNow I: Several Alfresco instances Current Approach: • Each Alfresco has its own Search subsystem • They can’t share indexes Implications: • Federated search is not an option • Results can’t be merged • If so, what resultset should be first? Conclusion Results could be presented to users in different tabs or “manually” merged. Not the best approach
  48. 48. #SummitNow#SummitNow I: Several Alfresco instances Zaizi Approach: • Our solution like search box • Which manages a single index Implications: • All documents are driven to same index • Users can select results from either all Alfresco instances or a subset Conclusion Search across Repositories Could be based Elastic Search, Solr Cloud, Amazon Cloud, etc.
  49. 49. #SummitNow#SummitNow II: Alfresco + Other data providers Current Approach: • Alfresco has its own Search subsystem • Other repository may have (or not) its own Search subsystem Implications: • Different data providers mean different formats • E.g. Filesystem does not support CMIS • Alfresco can’t reach external data Conclusion No way to merge results and present them uniformly to end users
  50. 50. #SummitNow#SummitNow II: Alfresco + Other data providers Zaizi Approach: • Both Alfresco and other repositories share Search subsystem (Manifold) Implications: • Alfresco and other providers results will have same format in our Solution • They will speak ‘our’ language • Alfresco reaches external data when communicating with our solution Conclusion Results are present and accessible between data providers
  51. 51. #SummitNow#SummitNow III: Alfresco + O(TB) data Current Approach: • Alfresco has its own Search subsystem • All data is in one (or several if cluster) Solr instance Implications: • Every Solr node manages the whole index • No chance to apply scale techniques for indexing: • Sharding, Replication… Conclusion Huge servers are required and performance might be compromised
  52. 52. #SummitNow#SummitNow III: Alfresco + O(TB) data Zaizi Approach: • Alfresco uses our solution • Data is indexed in search solution which better suits: • Amazon Cloud, Solr Cloud, Elastic Search… Implications: • Cloud Search solution manages index • Indexing techniques can be applied according to use cases • Sharding, Replication Conclusion Search strategy can be adopted and easily implemented with search solution which better fits
  53. 53. #SummitNow#SummitNow Apache Manifold: Other benefits Can extract, index and map information from any other sources • Apache Stanbol, RedLink, any other data enricher • Our solution will gather everything in one place • Documents, entities… Permissions are checked just once • Everything is in the same place, even user authorization capabilities • Performance and scalability is improved • Faceted search and other search capabilities are combined with such permission feature
  54. 54. #SummitNow#SummitNow Demo
  55. 55. #SummitNow#SummitNow Conclusions Zaizi solution allows searching and indexing in the most popular Cloud Search solutions • Other Search solutions can be integrated as well Zaizi solution allows retrieving information from the most popular repositories • Other Data providers can be integrated too • It solves plenty of current issues related search and indexing in Alfresco • Can be used outside Alfresco or even with Alfresco and any other data repository Zaizi solution manages permissions and security from the most popular repositories and the latest Cloud search technologies Fully supported by us!
  56. 56. #SummitNow#SummitNow Conclusions
  57. 57. #SummitNow#SummitNow What’s coming Powerful User Interface • Admin functions • Wide range of facets • UI for Share Benchmarking New connectors • Filesystem authority • RedLink repository • Stanbol repository Alfresco Search Subsystem?
  58. 58. #SummitNow

×