Apache ManifoldCF
About me
● Open Source ECM Specialist at Sourcesence

● Author and Technical Reviewer at Packt Publishing
   ○ Alfresco 3 Web Services (2010)
   ○ GateIn Cookbook (2012)

● Alfresco Community (nickname OpenPj)
   ○ Alfresco Wiki Gardener
   ○ Top 10 supporter (english and italian)
   ○ Moderator of the italian forum

● PMC Member at the Apache Software Foundation

● JBoss Community
   ○ Content editor for jboss.org
   ○ Project Leader and Committer for PortletSwap
Overview
● The story
● What is ManifoldCF?
   ○ What is a repository?
   ○ What is a search server?
● Why ManifoldCF?
● Architecture
● The growing path
   ○ The 0.3-incubating version
   ○ The 0.4-incubating version
   ○ The 0.5-incubating version
   ○ The 0.6 version (graduated ^__^)
   ○ What's new in the 1.0.1 version
● The book: ManifoldCF in Action
● Demo
● Resources
The story

The original ManifoldCF code base was granted by MetaCarta
Inc., to the Apache Software Foundation in December 2009.

The MetaCarta effort represented more than five years of
successful development and testing in multiple, challenging
enterprise environments.

The project was graduated as Apache Top Level Project in July
2012.

                               ^__^
What is ManifoldCF?
● Open Source crawler
  ○ schedule jobs to create indexes
     ■ get contents from repositories
     ■ push contents on search servers
                                         Search Server
  Repository 1
                                               1
                                         Search Server
  Repository 2     Apache ManifoldCF           2
                                         Search Server
  Repository 3
                                               3
What is ManifoldCF?
● Open Source crawler
  ○ schedule jobs to create indexes
     ■ get contents from repositories
     ■ push contents on search servers

● Out-Of-The-Box it is distributed as J2EE web apps
  ○ REST API
  ○ Authority Service
  ○ Crawler UI

● Can be embedded in any Java application
What is a repository?
● Open Source crawler
  ○ schedule jobs to create indexes
     ■ get contents from repositories
     ■ push contents on search servers
                                         Search Server
  Repository 1
                                               1
                                         Search Server
  Repository 2     Apache ManifoldCF           2
                                         Search Server
  Repository 3
                                               3
What is a repository?
● central place where to put and get contents
● contents are kept is an organized way
    ○ ER model is the old way
    ○ Node graph
        ■ properties (metadata)
        ■ associations
        ■ renditions
● base component of Enterprise Content Management (ECM)
  systems
● is from Latin repositorium
    ○ table of service
    ○ vessel
    ○ chamber
    ○ where to keep and find your things!!!
Enterprise Content Management
Enterprise content management (ECM) is a formalized means of
organizing and storing an organization's documents, and other
content, that relate to the organization's processes. The term
encompasses strategies, methods, and tools used throughout the
lifecycle of the content.

                                                                    Wikipedia
                       http://en.wikipedia.org/wiki/Enterprise_content_management
Enterprise Content Management
      ECM



                         Enterprise services




                   Workflows and processes




                                  Users +
            Repository            groups       BPM
                                (LDAP, IDM)
What is a repository? - You use it!!!
● Some simple examples:
   ○ SMTP servers
   ○ Google Drive
   ○ Dropbox

● Some Open Source repository implementations:
   ○ exoJCR
   ○ Apache JackRabbit

● Some Open Source ECM systems for critical usage:
   ○ Alfresco
   ○ Nuxeo
   ○ Hippo
What is a repository? - Decoration
                    CMIS
                     JCR
                    REST
                    SOAP
                    IMAP
         apply      EMAIL
        metadata     FTP         retrieve content using
                                        metadata

                                   Query Languages:
                                          CMIS
                                        JCR SQL
                   Repository             XPath
                                         Lucene
                                 Full Text (Google style)




                                Indexes
What is a repository? - Architecture

     APIs (CMIS, REST, FTP, WebDAV, IMAP)

                       Model


                   Storage


       Content Store           Indexes
What is a repository? - Repo Model
 ● different point of view of how managing data
    ○ no more Relational databases (ER)
 ● repositories offers you an API!
 ● based on the JCR Repository Model (JSR-283)
    ○ workspaces
    ○ identifiers
    ○ users
    ○ nodes and node types (contents)
       ■ properties and property types
       ■ associations (shared nodes)
What is a repository? - Repo Model
 ● A node is a generic content stored in a repository
    ○ type
    ○ properties
    ○ associations
    ○ binary streams (optional)
       ■ renditions
          ■ text document
          ■ Video
          ■ Image
          ■. . .
What is a repository? - Repo Model
                    Properties
      Type          (metadata):

                    - name
             Node
                    - description
                    - mimetype
                    - tags
                    - categories

                                    Renditions




       Binary 1        Binary 2              Binary 3
What is a repository? - Repo Model
     Repository

              Workspace
                  1
            Workspace
                2

      Workspace             Root node
          3


              A                B
                                            C
        D               E               G
Why use a repository?
 ● adding new node types means to add a configuration
 ● you can scale out easily
 ● storing very large amounts of data
 ● storing simple data structures, such as simple JSON
   documents
 ● looking up data by keys rather than using queries
 ● searching for data based upon relevance rather than criteria
 ● evolving schemas and/or data structures
 ● caching data in-memory for performance
 ● giving up consistency guarantees for increased availability
Why use a repository?
 ● Standard API
    ○ Content Management Interoperability Services (CMIS)
    ○ Java Content Repository (JCR)
 ● Hierarchical structure
 ● Transaction support
 ● Versioning
 ● Locking
 ● Observation
 ● References
 ● Navigation services
    ○ parents
    ○ children
    ○ associated
 ● Search services
What is a search server?
● Open Source crawler
  ○ schedule jobs to create indexes
     ■ get contents from repositories
     ■ push contents on search servers
                                         Search Server
  Repository 1
                                               1
                                         Search Server
  Repository 2     Apache ManifoldCF           2
                                         Search Server
  Repository 3
                                               3
What is a search server?
A search server is an application that allows users to
find repository contents quickly using:

●   keywords (full text search)
●   content fields
●   tags
●   categories
●   ranking

The informations kept are indexes.
What is a search server?


             REST API


            Storage

             Indexes
Why ManifoldCF?
● Reliability
● Incremental
● Multi repositories
● Security model
● Monitoring
Why ManifoldCF? - Reliability
Jobs scheduling and configuration are stored in the
database to maintain the state of all the executions

     Repository       Pull Agent Daemon             Search Server
                     configuration and scheduling




                            Database
Why ManifoldCF? - Incremental

Jobs can be optionally configured to re-visit contents
incrementally
                   Repository




         N1
                                  Apache ManifoldCF




              N2
                           N4
Why ManifoldCF? - Multi repositories
Jobs can retrieve contents from the following repositories:
 ● CMIS-compliant
 ● Alfresco
 ● IBM FileNet
 ● EMC Documentum
 ● Microsoft SharePoint
 ● OpenText LiveLink
 ● Autonomy Meridio
 ● Memex Patriarch
 ● Windows Share/DFS
 ● Generic JDBC
 ● Generic Filesystem
 ● Generic RSS and Web
Why ManifoldCF? - Multi repositories
Jobs can ingest contents to the following search
servers:
● Apache Solr
● ElasticSearch
● OpenSearchServer
● MetaCarta GTS
Why ManifoldCF? - Security model
Retrieve per-content ACLs                      Authority 1

                        Authority Service      Authority 2

                                               Authority 3


       Repository 1

       Repository 2    Pull Agent Daemon
                                            user access
       Repository 3                           tokens
                      doc access
                        tokens
                                                  user specific
                         Search Server              search
                                                    results
Why ManifoldCF? - Monitoring
UI Crawler allows you to:
 ● configure jobs and connectors
 ● monitor jobs execution
 ● monitor contents ingestion
   ○ status reports
      ■ document status
      ■ queue status
   ○ history reports
      ■ simple history
      ■ maximum activity
      ■ maximum bandwidth
      ■ result histogram
Architecture

● Pull Agent Daemon
   ○ Jobs
      ■ Repository Connectors
      ■ Output Connectors
      ■ Authority Connectors
Architecture

● Pull Agent Daemon (the core service)
   ○ Jobs (execute the ingestion tasks)
      ■ Repository Connectors (retrieve contents)
      ■ Output Connectors (ingest contents)
      ■ Authority Connectors (retrieve ACLs)
Architecture
                  Authority Service



                                      Search Server
   Repository 1
                                            1
                                      Search Server
   Repository 2   Pull Agent Daemon
                                            2
                                      Search Server
   Repository 3
                                            3




                      Database
Architecture - Job

A job is an ingestion work that consists of:
    ○ verbal description
    ○ repository connection
       ■ authority connection (optional)
    ○ metadata mapping
    ○ output connection (search server)
    ○ crawling model
    ○ scheduling information (on demand or time ranges)
Architecture - Job
                                                         Authority
                                                         Connector
                                        ACLs
        Repository
        Connector
                            retrieve                        Output
                          content ACL                      Connector



      Repository                     Job                  Search Server

- query to retrieve contents                           - metadata mapping
                                - verbal description   - content ingestion
                                - crawling model
                                - scheduling
The 0.3-incubating version

● CMIS Repository Connector
● OpenSearchServer Output Connector
● Scripting Language
● New Maven build process
● Several bug fixes
The 0.4-incubating version

● Alfresco Connector
● JDBC Connector now supports MySQL
● CMIS Connector upgraded to OpenCMIS 0.5.0
● Several bug fixes
The 0.5-incubating version

● Apache Velocity for connectors UI templates
● ElasticSearch Output Connector
● CMIS Connector upgraded to OpenCMIS 0.6.0
● Prebuild connector support: just add jars and go!
● New Japanese localization
● Several bug fixes
The 0.6 version
●   Project moved to JDK 1.6
●   Many improvements for all the connectors
●   Updated the Apache Solr plugin
●   Several bugfixes
What's new in 1.0.1 version
●   Microsoft SharePoint 2010 support
●   JDBC Connector now manages metadata
●   CMIS Connector upgraded to OpenCMIS 0.7.0
●   Several bugfixes
The book: ManifoldCF in Action

ManifoldCF in Action
by Karl Wright
published by Manning


Karl is the original developer and the
principal committer of Apache ManifoldCF


The book is available at the following site:
http://www.manning.com/wright
DEMO
Resources

Homepage:
http://manifoldcf.apache.org



Download page:
http://manifoldcf.apache.org/en_US/download.html
Thank you for your
       attention!
          ^__^

http://www.open4dev.com

Apache ManifoldCF @ Linux Day 2012

  • 1.
  • 2.
    About me ● OpenSource ECM Specialist at Sourcesence ● Author and Technical Reviewer at Packt Publishing ○ Alfresco 3 Web Services (2010) ○ GateIn Cookbook (2012) ● Alfresco Community (nickname OpenPj) ○ Alfresco Wiki Gardener ○ Top 10 supporter (english and italian) ○ Moderator of the italian forum ● PMC Member at the Apache Software Foundation ● JBoss Community ○ Content editor for jboss.org ○ Project Leader and Committer for PortletSwap
  • 3.
    Overview ● The story ●What is ManifoldCF? ○ What is a repository? ○ What is a search server? ● Why ManifoldCF? ● Architecture ● The growing path ○ The 0.3-incubating version ○ The 0.4-incubating version ○ The 0.5-incubating version ○ The 0.6 version (graduated ^__^) ○ What's new in the 1.0.1 version ● The book: ManifoldCF in Action ● Demo ● Resources
  • 4.
    The story The originalManifoldCF code base was granted by MetaCarta Inc., to the Apache Software Foundation in December 2009. The MetaCarta effort represented more than five years of successful development and testing in multiple, challenging enterprise environments. The project was graduated as Apache Top Level Project in July 2012. ^__^
  • 5.
    What is ManifoldCF? ●Open Source crawler ○ schedule jobs to create indexes ■ get contents from repositories ■ push contents on search servers Search Server Repository 1 1 Search Server Repository 2 Apache ManifoldCF 2 Search Server Repository 3 3
  • 6.
    What is ManifoldCF? ●Open Source crawler ○ schedule jobs to create indexes ■ get contents from repositories ■ push contents on search servers ● Out-Of-The-Box it is distributed as J2EE web apps ○ REST API ○ Authority Service ○ Crawler UI ● Can be embedded in any Java application
  • 7.
    What is arepository? ● Open Source crawler ○ schedule jobs to create indexes ■ get contents from repositories ■ push contents on search servers Search Server Repository 1 1 Search Server Repository 2 Apache ManifoldCF 2 Search Server Repository 3 3
  • 8.
    What is arepository? ● central place where to put and get contents ● contents are kept is an organized way ○ ER model is the old way ○ Node graph ■ properties (metadata) ■ associations ■ renditions ● base component of Enterprise Content Management (ECM) systems ● is from Latin repositorium ○ table of service ○ vessel ○ chamber ○ where to keep and find your things!!!
  • 9.
    Enterprise Content Management Enterprisecontent management (ECM) is a formalized means of organizing and storing an organization's documents, and other content, that relate to the organization's processes. The term encompasses strategies, methods, and tools used throughout the lifecycle of the content. Wikipedia http://en.wikipedia.org/wiki/Enterprise_content_management
  • 10.
    Enterprise Content Management ECM Enterprise services Workflows and processes Users + Repository groups BPM (LDAP, IDM)
  • 11.
    What is arepository? - You use it!!! ● Some simple examples: ○ SMTP servers ○ Google Drive ○ Dropbox ● Some Open Source repository implementations: ○ exoJCR ○ Apache JackRabbit ● Some Open Source ECM systems for critical usage: ○ Alfresco ○ Nuxeo ○ Hippo
  • 12.
    What is arepository? - Decoration CMIS JCR REST SOAP IMAP apply EMAIL metadata FTP retrieve content using metadata Query Languages: CMIS JCR SQL Repository XPath Lucene Full Text (Google style) Indexes
  • 13.
    What is arepository? - Architecture APIs (CMIS, REST, FTP, WebDAV, IMAP) Model Storage Content Store Indexes
  • 14.
    What is arepository? - Repo Model ● different point of view of how managing data ○ no more Relational databases (ER) ● repositories offers you an API! ● based on the JCR Repository Model (JSR-283) ○ workspaces ○ identifiers ○ users ○ nodes and node types (contents) ■ properties and property types ■ associations (shared nodes)
  • 15.
    What is arepository? - Repo Model ● A node is a generic content stored in a repository ○ type ○ properties ○ associations ○ binary streams (optional) ■ renditions ■ text document ■ Video ■ Image ■. . .
  • 16.
    What is arepository? - Repo Model Properties Type (metadata): - name Node - description - mimetype - tags - categories Renditions Binary 1 Binary 2 Binary 3
  • 17.
    What is arepository? - Repo Model Repository Workspace 1 Workspace 2 Workspace Root node 3 A B C D E G
  • 18.
    Why use arepository? ● adding new node types means to add a configuration ● you can scale out easily ● storing very large amounts of data ● storing simple data structures, such as simple JSON documents ● looking up data by keys rather than using queries ● searching for data based upon relevance rather than criteria ● evolving schemas and/or data structures ● caching data in-memory for performance ● giving up consistency guarantees for increased availability
  • 19.
    Why use arepository? ● Standard API ○ Content Management Interoperability Services (CMIS) ○ Java Content Repository (JCR) ● Hierarchical structure ● Transaction support ● Versioning ● Locking ● Observation ● References ● Navigation services ○ parents ○ children ○ associated ● Search services
  • 20.
    What is asearch server? ● Open Source crawler ○ schedule jobs to create indexes ■ get contents from repositories ■ push contents on search servers Search Server Repository 1 1 Search Server Repository 2 Apache ManifoldCF 2 Search Server Repository 3 3
  • 21.
    What is asearch server? A search server is an application that allows users to find repository contents quickly using: ● keywords (full text search) ● content fields ● tags ● categories ● ranking The informations kept are indexes.
  • 22.
    What is asearch server? REST API Storage Indexes
  • 23.
    Why ManifoldCF? ● Reliability ●Incremental ● Multi repositories ● Security model ● Monitoring
  • 24.
    Why ManifoldCF? -Reliability Jobs scheduling and configuration are stored in the database to maintain the state of all the executions Repository Pull Agent Daemon Search Server configuration and scheduling Database
  • 25.
    Why ManifoldCF? -Incremental Jobs can be optionally configured to re-visit contents incrementally Repository N1 Apache ManifoldCF N2 N4
  • 26.
    Why ManifoldCF? -Multi repositories Jobs can retrieve contents from the following repositories: ● CMIS-compliant ● Alfresco ● IBM FileNet ● EMC Documentum ● Microsoft SharePoint ● OpenText LiveLink ● Autonomy Meridio ● Memex Patriarch ● Windows Share/DFS ● Generic JDBC ● Generic Filesystem ● Generic RSS and Web
  • 27.
    Why ManifoldCF? -Multi repositories Jobs can ingest contents to the following search servers: ● Apache Solr ● ElasticSearch ● OpenSearchServer ● MetaCarta GTS
  • 28.
    Why ManifoldCF? -Security model Retrieve per-content ACLs Authority 1 Authority Service Authority 2 Authority 3 Repository 1 Repository 2 Pull Agent Daemon user access Repository 3 tokens doc access tokens user specific Search Server search results
  • 29.
    Why ManifoldCF? -Monitoring UI Crawler allows you to: ● configure jobs and connectors ● monitor jobs execution ● monitor contents ingestion ○ status reports ■ document status ■ queue status ○ history reports ■ simple history ■ maximum activity ■ maximum bandwidth ■ result histogram
  • 30.
    Architecture ● Pull AgentDaemon ○ Jobs ■ Repository Connectors ■ Output Connectors ■ Authority Connectors
  • 31.
    Architecture ● Pull AgentDaemon (the core service) ○ Jobs (execute the ingestion tasks) ■ Repository Connectors (retrieve contents) ■ Output Connectors (ingest contents) ■ Authority Connectors (retrieve ACLs)
  • 32.
    Architecture Authority Service Search Server Repository 1 1 Search Server Repository 2 Pull Agent Daemon 2 Search Server Repository 3 3 Database
  • 33.
    Architecture - Job Ajob is an ingestion work that consists of: ○ verbal description ○ repository connection ■ authority connection (optional) ○ metadata mapping ○ output connection (search server) ○ crawling model ○ scheduling information (on demand or time ranges)
  • 34.
    Architecture - Job Authority Connector ACLs Repository Connector retrieve Output content ACL Connector Repository Job Search Server - query to retrieve contents - metadata mapping - verbal description - content ingestion - crawling model - scheduling
  • 35.
    The 0.3-incubating version ●CMIS Repository Connector ● OpenSearchServer Output Connector ● Scripting Language ● New Maven build process ● Several bug fixes
  • 36.
    The 0.4-incubating version ●Alfresco Connector ● JDBC Connector now supports MySQL ● CMIS Connector upgraded to OpenCMIS 0.5.0 ● Several bug fixes
  • 37.
    The 0.5-incubating version ●Apache Velocity for connectors UI templates ● ElasticSearch Output Connector ● CMIS Connector upgraded to OpenCMIS 0.6.0 ● Prebuild connector support: just add jars and go! ● New Japanese localization ● Several bug fixes
  • 38.
    The 0.6 version ● Project moved to JDK 1.6 ● Many improvements for all the connectors ● Updated the Apache Solr plugin ● Several bugfixes
  • 39.
    What's new in1.0.1 version ● Microsoft SharePoint 2010 support ● JDBC Connector now manages metadata ● CMIS Connector upgraded to OpenCMIS 0.7.0 ● Several bugfixes
  • 40.
    The book: ManifoldCFin Action ManifoldCF in Action by Karl Wright published by Manning Karl is the original developer and the principal committer of Apache ManifoldCF The book is available at the following site: http://www.manning.com/wright
  • 41.
  • 42.
  • 43.
    Thank you foryour attention! ^__^ http://www.open4dev.com