Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apache ManifoldCF
About me● Open Source ECM Specialist at Sourcesence● Author and Technical Reviewer at Packt Publishing   ○ Alfresco 3 Web ...
Overview● The story● What is ManifoldCF?   ○ What is a repository?   ○ What is a search server?● Why ManifoldCF?● Architec...
The storyThe original ManifoldCF code base was granted by MetaCartaInc., to the Apache Software Foundation in December 200...
What is ManifoldCF?● Open Source crawler  ○ schedule jobs to create indexes     ■ get contents from repositories     ■ pus...
What is ManifoldCF?● Open Source crawler  ○ schedule jobs to create indexes     ■ get contents from repositories     ■ pus...
What is a repository?● Open Source crawler  ○ schedule jobs to create indexes     ■ get contents from repositories     ■ p...
What is a repository?● central place where to put and get contents● contents are kept is an organized way    ○ ER model is...
Enterprise Content ManagementEnterprise content management (ECM) is a formalized means oforganizing and storing an organiz...
Enterprise Content Management      ECM                         Enterprise services                   Workflows and process...
What is a repository? - You use it!!!● Some simple examples:   ○ SMTP servers   ○ Google Drive   ○ Dropbox● Some Open Sour...
What is a repository? - Decoration                    CMIS                     JCR                    REST                ...
What is a repository? - Architecture     APIs (CMIS, REST, FTP, WebDAV, IMAP)                       Model                 ...
What is a repository? - Repo Model ● different point of view of how managing data    ○ no more Relational databases (ER) ●...
What is a repository? - Repo Model ● A node is a generic content stored in a repository    ○ type    ○ properties    ○ ass...
What is a repository? - Repo Model                    Properties      Type          (metadata):                    - name ...
What is a repository? - Repo Model     Repository              Workspace                  1            Workspace          ...
Why use a repository? ● adding new node types means to add a configuration ● you can scale out easily ● storing very large...
Why use a repository? ● Standard API    ○ Content Management Interoperability Services (CMIS)    ○ Java Content Repository...
What is a search server?● Open Source crawler  ○ schedule jobs to create indexes     ■ get contents from repositories     ...
What is a search server?A search server is an application that allows users tofind repository contents quickly using:●   k...
What is a search server?             REST API            Storage             Indexes
Why ManifoldCF?● Reliability● Incremental● Multi repositories● Security model● Monitoring
Why ManifoldCF? - ReliabilityJobs scheduling and configuration are stored in thedatabase to maintain the state of all the ...
Why ManifoldCF? - IncrementalJobs can be optionally configured to re-visit contentsincrementally                   Reposit...
Why ManifoldCF? - Multi repositoriesJobs can retrieve contents from the following repositories: ● CMIS-compliant ● Alfresc...
Why ManifoldCF? - Multi repositoriesJobs can ingest contents to the following searchservers:● Apache Solr● ElasticSearch● ...
Why ManifoldCF? - Security modelRetrieve per-content ACLs                      Authority 1                        Authorit...
Why ManifoldCF? - MonitoringUI Crawler allows you to: ● configure jobs and connectors ● monitor jobs execution ● monitor c...
Architecture● Pull Agent Daemon   ○ Jobs      ■ Repository Connectors      ■ Output Connectors      ■ Authority Connectors
Architecture● Pull Agent Daemon (the core service)   ○ Jobs (execute the ingestion tasks)      ■ Repository Connectors (re...
Architecture                  Authority Service                                      Search Server   Repository 1         ...
Architecture - JobA job is an ingestion work that consists of:    ○ verbal description    ○ repository connection       ■ ...
Architecture - Job                                                         Authority                                      ...
The 0.3-incubating version● CMIS Repository Connector● OpenSearchServer Output Connector● Scripting Language● New Maven bu...
The 0.4-incubating version● Alfresco Connector● JDBC Connector now supports MySQL● CMIS Connector upgraded to OpenCMIS 0.5...
The 0.5-incubating version● Apache Velocity for connectors UI templates● ElasticSearch Output Connector● CMIS Connector up...
The 0.6 version●   Project moved to JDK 1.6●   Many improvements for all the connectors●   Updated the Apache Solr plugin●...
Whats new in 1.0.1 version●   Microsoft SharePoint 2010 support●   JDBC Connector now manages metadata●   CMIS Connector u...
The book: ManifoldCF in ActionManifoldCF in Actionby Karl Wrightpublished by ManningKarl is the original developer and the...
DEMO
ResourcesHomepage:http://manifoldcf.apache.orgDownload page:http://manifoldcf.apache.org/en_US/download.html
Thank you for your       attention!          ^__^http://www.open4dev.com
Upcoming SlideShare
Loading in …5
×

Apache ManifoldCF @ Linux Day 2012

2,660 views

Published on

An overview about Apache ManifoldCF with an introduction to repositories and search servers. Includes an overview about the latest improvements and new features.

Published in: Technology
  • Be the first to comment

Apache ManifoldCF @ Linux Day 2012

  1. 1. Apache ManifoldCF
  2. 2. About me● Open Source ECM Specialist at Sourcesence● Author and Technical Reviewer at Packt Publishing ○ Alfresco 3 Web Services (2010) ○ GateIn Cookbook (2012)● Alfresco Community (nickname OpenPj) ○ Alfresco Wiki Gardener ○ Top 10 supporter (english and italian) ○ Moderator of the italian forum● PMC Member at the Apache Software Foundation● JBoss Community ○ Content editor for jboss.org ○ Project Leader and Committer for PortletSwap
  3. 3. Overview● The story● What is ManifoldCF? ○ What is a repository? ○ What is a search server?● Why ManifoldCF?● Architecture● The growing path ○ The 0.3-incubating version ○ The 0.4-incubating version ○ The 0.5-incubating version ○ The 0.6 version (graduated ^__^) ○ Whats new in the 1.0.1 version● The book: ManifoldCF in Action● Demo● Resources
  4. 4. The storyThe original ManifoldCF code base was granted by MetaCartaInc., to the Apache Software Foundation in December 2009.The MetaCarta effort represented more than five years ofsuccessful development and testing in multiple, challengingenterprise environments.The project was graduated as Apache Top Level Project in July2012. ^__^
  5. 5. What is ManifoldCF?● Open Source crawler ○ schedule jobs to create indexes ■ get contents from repositories ■ push contents on search servers Search Server Repository 1 1 Search Server Repository 2 Apache ManifoldCF 2 Search Server Repository 3 3
  6. 6. What is ManifoldCF?● Open Source crawler ○ schedule jobs to create indexes ■ get contents from repositories ■ push contents on search servers● Out-Of-The-Box it is distributed as J2EE web apps ○ REST API ○ Authority Service ○ Crawler UI● Can be embedded in any Java application
  7. 7. What is a repository?● Open Source crawler ○ schedule jobs to create indexes ■ get contents from repositories ■ push contents on search servers Search Server Repository 1 1 Search Server Repository 2 Apache ManifoldCF 2 Search Server Repository 3 3
  8. 8. What is a repository?● central place where to put and get contents● contents are kept is an organized way ○ ER model is the old way ○ Node graph ■ properties (metadata) ■ associations ■ renditions● base component of Enterprise Content Management (ECM) systems● is from Latin repositorium ○ table of service ○ vessel ○ chamber ○ where to keep and find your things!!!
  9. 9. Enterprise Content ManagementEnterprise content management (ECM) is a formalized means oforganizing and storing an organizations documents, and othercontent, that relate to the organizations processes. The termencompasses strategies, methods, and tools used throughout thelifecycle of the content. Wikipedia http://en.wikipedia.org/wiki/Enterprise_content_management
  10. 10. Enterprise Content Management ECM Enterprise services Workflows and processes Users + Repository groups BPM (LDAP, IDM)
  11. 11. What is a repository? - You use it!!!● Some simple examples: ○ SMTP servers ○ Google Drive ○ Dropbox● Some Open Source repository implementations: ○ exoJCR ○ Apache JackRabbit● Some Open Source ECM systems for critical usage: ○ Alfresco ○ Nuxeo ○ Hippo
  12. 12. What is a repository? - Decoration CMIS JCR REST SOAP IMAP apply EMAIL metadata FTP retrieve content using metadata Query Languages: CMIS JCR SQL Repository XPath Lucene Full Text (Google style) Indexes
  13. 13. What is a repository? - Architecture APIs (CMIS, REST, FTP, WebDAV, IMAP) Model Storage Content Store Indexes
  14. 14. What is a repository? - Repo Model ● different point of view of how managing data ○ no more Relational databases (ER) ● repositories offers you an API! ● based on the JCR Repository Model (JSR-283) ○ workspaces ○ identifiers ○ users ○ nodes and node types (contents) ■ properties and property types ■ associations (shared nodes)
  15. 15. What is a repository? - Repo Model ● A node is a generic content stored in a repository ○ type ○ properties ○ associations ○ binary streams (optional) ■ renditions ■ text document ■ Video ■ Image ■. . .
  16. 16. What is a repository? - Repo Model Properties Type (metadata): - name Node - description - mimetype - tags - categories Renditions Binary 1 Binary 2 Binary 3
  17. 17. What is a repository? - Repo Model Repository Workspace 1 Workspace 2 Workspace Root node 3 A B C D E G
  18. 18. Why use a repository? ● adding new node types means to add a configuration ● you can scale out easily ● storing very large amounts of data ● storing simple data structures, such as simple JSON documents ● looking up data by keys rather than using queries ● searching for data based upon relevance rather than criteria ● evolving schemas and/or data structures ● caching data in-memory for performance ● giving up consistency guarantees for increased availability
  19. 19. Why use a repository? ● Standard API ○ Content Management Interoperability Services (CMIS) ○ Java Content Repository (JCR) ● Hierarchical structure ● Transaction support ● Versioning ● Locking ● Observation ● References ● Navigation services ○ parents ○ children ○ associated ● Search services
  20. 20. What is a search server?● Open Source crawler ○ schedule jobs to create indexes ■ get contents from repositories ■ push contents on search servers Search Server Repository 1 1 Search Server Repository 2 Apache ManifoldCF 2 Search Server Repository 3 3
  21. 21. What is a search server?A search server is an application that allows users tofind repository contents quickly using:● keywords (full text search)● content fields● tags● categories● rankingThe informations kept are indexes.
  22. 22. What is a search server? REST API Storage Indexes
  23. 23. Why ManifoldCF?● Reliability● Incremental● Multi repositories● Security model● Monitoring
  24. 24. Why ManifoldCF? - ReliabilityJobs scheduling and configuration are stored in thedatabase to maintain the state of all the executions Repository Pull Agent Daemon Search Server configuration and scheduling Database
  25. 25. Why ManifoldCF? - IncrementalJobs can be optionally configured to re-visit contentsincrementally Repository N1 Apache ManifoldCF N2 N4
  26. 26. Why ManifoldCF? - Multi repositoriesJobs can retrieve contents from the following repositories: ● CMIS-compliant ● Alfresco ● IBM FileNet ● EMC Documentum ● Microsoft SharePoint ● OpenText LiveLink ● Autonomy Meridio ● Memex Patriarch ● Windows Share/DFS ● Generic JDBC ● Generic Filesystem ● Generic RSS and Web
  27. 27. Why ManifoldCF? - Multi repositoriesJobs can ingest contents to the following searchservers:● Apache Solr● ElasticSearch● OpenSearchServer● MetaCarta GTS
  28. 28. Why ManifoldCF? - Security modelRetrieve per-content ACLs Authority 1 Authority Service Authority 2 Authority 3 Repository 1 Repository 2 Pull Agent Daemon user access Repository 3 tokens doc access tokens user specific Search Server search results
  29. 29. Why ManifoldCF? - MonitoringUI Crawler allows you to: ● configure jobs and connectors ● monitor jobs execution ● monitor contents ingestion ○ status reports ■ document status ■ queue status ○ history reports ■ simple history ■ maximum activity ■ maximum bandwidth ■ result histogram
  30. 30. Architecture● Pull Agent Daemon ○ Jobs ■ Repository Connectors ■ Output Connectors ■ Authority Connectors
  31. 31. Architecture● Pull Agent Daemon (the core service) ○ Jobs (execute the ingestion tasks) ■ Repository Connectors (retrieve contents) ■ Output Connectors (ingest contents) ■ Authority Connectors (retrieve ACLs)
  32. 32. Architecture Authority Service Search Server Repository 1 1 Search Server Repository 2 Pull Agent Daemon 2 Search Server Repository 3 3 Database
  33. 33. Architecture - JobA job is an ingestion work that consists of: ○ verbal description ○ repository connection ■ authority connection (optional) ○ metadata mapping ○ output connection (search server) ○ crawling model ○ scheduling information (on demand or time ranges)
  34. 34. Architecture - Job Authority Connector ACLs Repository Connector retrieve Output content ACL Connector Repository Job Search Server- query to retrieve contents - metadata mapping - verbal description - content ingestion - crawling model - scheduling
  35. 35. The 0.3-incubating version● CMIS Repository Connector● OpenSearchServer Output Connector● Scripting Language● New Maven build process● Several bug fixes
  36. 36. The 0.4-incubating version● Alfresco Connector● JDBC Connector now supports MySQL● CMIS Connector upgraded to OpenCMIS 0.5.0● Several bug fixes
  37. 37. The 0.5-incubating version● Apache Velocity for connectors UI templates● ElasticSearch Output Connector● CMIS Connector upgraded to OpenCMIS 0.6.0● Prebuild connector support: just add jars and go!● New Japanese localization● Several bug fixes
  38. 38. The 0.6 version● Project moved to JDK 1.6● Many improvements for all the connectors● Updated the Apache Solr plugin● Several bugfixes
  39. 39. Whats new in 1.0.1 version● Microsoft SharePoint 2010 support● JDBC Connector now manages metadata● CMIS Connector upgraded to OpenCMIS 0.7.0● Several bugfixes
  40. 40. The book: ManifoldCF in ActionManifoldCF in Actionby Karl Wrightpublished by ManningKarl is the original developer and theprincipal committer of Apache ManifoldCFThe book is available at the following site:http://www.manning.com/wright
  41. 41. DEMO
  42. 42. ResourcesHomepage:http://manifoldcf.apache.orgDownload page:http://manifoldcf.apache.org/en_US/download.html
  43. 43. Thank you for your attention! ^__^http://www.open4dev.com

×