2. Overview
● The story
● What is ManifoldCF?
● Why ManifoldCF?
● Architecture
● The 0.3-incubating version
● The 0.4-incubating version
● What's new in the 0.5-incubating
● The book: ManifoldCF in Action
● Demo
● Resources
3. The story
The original ManifoldCF code base was granted by MetaCarta Inc.,
to the Apache Software Foundation in December 2009.
The MetaCarta effort represented more than five years of successful
development and testing in multiple, challenging enterprise
environments.
The project is in the Apache Incubator because the community was
not yet diverse enough, but now the project is towards graduation.
^__^
4. What is ManifoldCF?
● Open Source crawler
○ schedule jobs to create indexes
■ get contents from repositories
■ push contents on search servers
5. What is ManifoldCF?
● Open Source crawler
○ schedule jobs to create indexes
■ get contents from repositories
■ push contents on search servers
● Out-Of-The-Box it is distributed as J2EE web apps
○ REST API
○ Authority Service
○ Crawler UI
● Can be embedded in any Java application
7. Why ManifoldCF? - Reliability
Jobs scheduling and configuration are stored in the database
to maintain the state of all the executions
8. Why ManifoldCF? - Incremental
Jobs can be optionally configured to re-visit contents
incrementally
9. Why ManifoldCF? - Multi repositories
Jobs can retrieve contents from the following repositories:
● CMIS-compliant
● Alfresco
● IBM FileNet
● EMC Documentum
● Microsoft SharePoint
● OpenText LiveLink
● Autonomy Meridio
● Memex Patriarch
● Windows Share/DFS
● Generic JDBC
● Generic Filesystem
● Generic RSS and Web
10. Why ManifoldCF? - Multi repositories
Jobs can ingest contents to the following search servers:
● ElasticSearch
● OpenSearchServer
● Apache Solr
● MetaCarta GTS
12. Why ManifoldCF? - Monitoring
UI Crawler allows you to:
● configure jobs and connectors
● monitor jobs execution
● monitor contents ingestion
○ status reports
■ document status
■ queue status
○ history reports
■ simple history
■ maximum activity
■ maximum bandwidth
■ result histogram
16. Architecture - Job
A job is an ingestion work that consists of:
○ verbal description
○ repository connection
■ authority connection (optional)
○ metadata mapping
○ output connection (search server)
○ crawling model
○ scheduling information (on demand or time ranges)
18. The 0.3-incubating version
● CMIS Repository Connector
● OpenSearchServer Output Connector
● Scripting Language
● New Maven build process
● Several bug fixes
19. The 0.4-incubating version
● Alfresco Connector
● JDBC Connector now supports MySQL
● CMIS Connector upgraded to OpenCMIS 0.5.0
● Several bug fixes
20. What's new in the 0.5-incubating
● Apache Velocity for connectors UI templates
● ElasticSearch Output Connector
● CMIS Connector upgraded to OpenCMIS 0.6.0
● Prebuild connector support: just add jars and go!
● New Japanese localization
● Several bug fixes
21. The book: ManifoldCF in Action
ManifoldCF in Action
by Karl Wright
published by Manning
Karl is the original developer and the
principal committer of Apache ManifoldCF
The book is available at the following site:
http://www.manning.com/wright