Apache ManifoldCF

•

6 likes•7,601 views

An overview on Apache ManifoldCF the Open Source crawler that allows to configure jobs to manage search indexes taking contents from repositories.

Technology

Overview

● The story
● What is ManifoldCF?
● Why ManifoldCF?
● Architecture
● The 0.3-incubating version
● The 0.4-incubating version
● What's new in the 0.5-incubating
● The book: ManifoldCF in Action
● Demo
● Resources

The story

The original ManifoldCF code base was granted by MetaCarta Inc.,
to the Apache Software Foundation in December 2009.

The MetaCarta effort represented more than five years of successful
development and testing in multiple, challenging enterprise
environments.

The project is in the Apache Incubator because the community was
not yet diverse enough, but now the project is towards graduation.
^__^

What is ManifoldCF?
● Open Source crawler
○ schedule jobs to create indexes
■ get contents from repositories
■ push contents on search servers

What is ManifoldCF?
● Open Source crawler
○ schedule jobs to create indexes
■ get contents from repositories
■ push contents on search servers

● Out-Of-The-Box it is distributed as J2EE web apps
○ REST API
○ Authority Service
○ Crawler UI

● Can be embedded in any Java application

Why ManifoldCF?
● Reliability
● Incremental
● Multi repositories
● Security model
● Monitoring

Why ManifoldCF? - Reliability

Jobs scheduling and configuration are stored in the database
to maintain the state of all the executions

Why ManifoldCF? - Incremental

Jobs can be optionally configured to re-visit contents
incrementally

Why ManifoldCF? - Multi repositories

Jobs can retrieve contents from the following repositories:
● CMIS-compliant
● Alfresco
● IBM FileNet
● EMC Documentum
● Microsoft SharePoint
● OpenText LiveLink
● Autonomy Meridio
● Memex Patriarch
● Windows Share/DFS
● Generic JDBC
● Generic Filesystem
● Generic RSS and Web

Why ManifoldCF? - Multi repositories

Jobs can ingest contents to the following search servers:
● ElasticSearch
● OpenSearchServer
● Apache Solr
● MetaCarta GTS

Why ManifoldCF? - Security model

Retrieve per-content ACLs

Why ManifoldCF? - Monitoring

UI Crawler allows you to:
● configure jobs and connectors
● monitor jobs execution
● monitor contents ingestion
○ status reports
■ document status
■ queue status
○ history reports
■ simple history
■ maximum activity
■ maximum bandwidth
■ result histogram

Architecture

● Pull Agent Daemon
○ Jobs
■ Repository Connectors
■ Output Connectors
■ Authority Connectors

Architecture

● Pull Agent Daemon (the core service)
○ Jobs (execute the ingestion tasks)
■ Repository Connectors (retrieve contents)
■ Output Connectors (ingest contents)
■ Authority Connectors (retrieve ACLs)

Architecture - Job

A job is an ingestion work that consists of:
○ verbal description
○ repository connection
■ authority connection (optional)
○ metadata mapping
○ output connection (search server)
○ crawling model
○ scheduling information (on demand or time ranges)

The 0.3-incubating version

● CMIS Repository Connector
● OpenSearchServer Output Connector
● Scripting Language
● New Maven build process
● Several bug fixes

The 0.4-incubating version

● Alfresco Connector
● JDBC Connector now supports MySQL
● CMIS Connector upgraded to OpenCMIS 0.5.0
● Several bug fixes

What's new in the 0.5-incubating

● Apache Velocity for connectors UI templates
● ElasticSearch Output Connector
● CMIS Connector upgraded to OpenCMIS 0.6.0
● Prebuild connector support: just add jars and go!
● New Japanese localization
● Several bug fixes

The book: ManifoldCF in Action

ManifoldCF in Action
by Karl Wright
published by Manning

Karl is the original developer and the
principal committer of Apache ManifoldCF

The book is available at the following site:
http://www.manning.com/wright

Resources

Homepage:
http://incubator.apache.org/connectors

Download page:
http://incubator.apache.org/connectors/download.html

What's hot

Spring Framework tola99

Web services - A Practical ApproachMadhaiyan Muthu

Getting Started with Amazon EC2Amazon Web Services

Ppt of blogsKritika Chauhan

Pre-launch Checklist for Going Production on AWS Amazon Web Services

WebLogic Scripting Tool OverviewJames Bayer

AWS EC2 and ELB troubleshootingShiva Narayanaswamy

Intro to AWS Lambda Amazon Web Services

Enterprise java unit-2_chapter-1sandeep54552

Active mq Installation and Master Slave setupRamakrishna Narkedamilli

Amazon services ec2Ismaeel Enjreny

Using AWS Key Management Service for Secure WorkloadsAmazon Web Services

An Introduction To REST APIAniruddh Bhilvare

How To Select A Training Vendor ScorecardTrainSmart - A Global Training and Development Company

Mule SAP connectorAnkush Sharma

Application MigrationsAmazon Web Services

Introduction to Amazon EC2Amazon Web Services

Spring Framework - AOPDzmitry Naskou

Serverless Microservices Communication with Amazon EventBridgeSheenBrisals

AWS Simple Storage Service (s3) zekeLabs Technologies

What's hot (20)

Spring Framework

Web services - A Practical Approach

Getting Started with Amazon EC2

Ppt of blogs

Pre-launch Checklist for Going Production on AWS

WebLogic Scripting Tool Overview

AWS EC2 and ELB troubleshooting

Intro to AWS Lambda

Enterprise java unit-2_chapter-1

Active mq Installation and Master Slave setup

Amazon services ec2

Using AWS Key Management Service for Secure Workloads

An Introduction To REST API

How To Select A Training Vendor Scorecard

Mule SAP connector

Application Migrations

Introduction to Amazon EC2

Spring Framework - AOP

Serverless Microservices Communication with Amazon EventBridge

AWS Simple Storage Service (s3)

Viewers also liked

Apache ManifoldCFShinichiro Abe

Integrate ManifoldCF with Solrfrancelabs

Super Size Your SearchPiergiorgio Lucidi

Integrating Alfresco with PortalsPiergiorgio Lucidi

A Novel methodology for handling Document Level Security in Search Based Appl...lucenerevolution

Web scraping with nutch solrMike Frampton

Apache Solr-WebinarEdureka!

Viewers also liked (7)

Apache ManifoldCF

Integrate ManifoldCF with Solr

Super Size Your Search

Integrating Alfresco with Portals

A Novel methodology for handling Document Level Security in Search Based Appl...

Web scraping with nutch solr

Apache Solr-Webinar

Similar to Apache ManifoldCF

Alfresco WebScript Connector for Apache ManifoldCFPiergiorgio Lucidi

Apache ManifoldCF @ Linux Day 2012Piergiorgio Lucidi

Openshift service broker and catalog ocp-meetup july 2018Michael Calizo

2016_04_04_CNI_Spring_Meeting_MicroservicesJason Varghese

DevOps for TYPO3 Teams and ProjectsFedir RYKHTIK

Experiences with Evangelizing Java Within the DatabaseMarcelo Ochoa

Red Hat Storage RoadmapColleen Corrice

Red Hat Storage RoadmapRed_Hat_Storage

Enterprise Integration Patterns with Apache CamelIoan Eugen Stan

#RADC4L16: An API-First Archives Approach at NPRCamille Salas

RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData

AngularJS 1.x - your first application (problems and solutions)Igor Talevski

Developing Microservices using Spring - Beginner's GuideMohanraj Thirumoorthy

Eclipse ApricotNuxeo

Kotlin REST & GraphQL APISean O'Brien

Introducing Apricot, The Eclipse Content Management PlatformNuxeo

API workshop by AWS and 3scale3scale

Melbourne User Group OAK and MongoDBYuval Ararat

Modern application development with oracle cloud sangam17Vinay Kumar

Design Summit - Technology Vision - Oleg Barenboim and Jason FreyManageIQ

Similar to Apache ManifoldCF (20)

Alfresco WebScript Connector for Apache ManifoldCF

Apache ManifoldCF @ Linux Day 2012

Openshift service broker and catalog ocp-meetup july 2018

2016_04_04_CNI_Spring_Meeting_Microservices

DevOps for TYPO3 Teams and Projects

Experiences with Evangelizing Java Within the Database

Red Hat Storage Roadmap

Enterprise Integration Patterns with Apache Camel

#RADC4L16: An API-First Archives Approach at NPR

RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro

AngularJS 1.x - your first application (problems and solutions)

Developing Microservices using Spring - Beginner's Guide

Eclipse Apricot

Kotlin REST & GraphQL API

Introducing Apricot, The Eclipse Content Management Platform

API workshop by AWS and 3scale

Melbourne User Group OAK and MongoDB

Modern application development with oracle cloud sangam17

Design Summit - Technology Vision - Oleg Barenboim and Jason Frey

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

From Family Reminiscence to Scholarly Archive .Alan Dix

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Story boards and shot lists for my a level piececharlottematthew16

Gen AI in Business - Global Trends Report 2024.pdfAddepto

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

How to write a Business Continuity PlanDatabarracks

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024

DevEX - reference for building teams, processes, and platforms

From Family Reminiscence to Scholarly Archive .

The Ultimate Guide to Choosing WordPress Pros and Cons

DSPy a system for AI to Write Prompts and Do Fine Tuning

Story boards and shot lists for my a level piece

Gen AI in Business - Global Trends Report 2024.pdf

DMCC Future of Trade Web3 - Special Edition

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

How to write a Business Continuity Plan

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Streamlining Python Development: A Guide to a Modern Project Setup

Human Factors of XR: Using Human Factors to Design XR Systems

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

What's New in Teams Calling, Meetings and Devices March 2024

How AI, OpenAI, and ChatGPT impact business and software.

Vertex AI Gemini Prompt Engineering Tips

Take control of your SAP testing with UiPath Test Suite

Unraveling Multimodality with Large Language Models.pdf

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Apache ManifoldCF

1. Apache ManifoldCF

2. Overview ● The story ● What is ManifoldCF? ● Why ManifoldCF? ● Architecture ● The 0.3-incubating version ● The 0.4-incubating version ● What's new in the 0.5-incubating ● The book: ManifoldCF in Action ● Demo ● Resources

3. The story The original ManifoldCF code base was granted by MetaCarta Inc., to the Apache Software Foundation in December 2009. The MetaCarta effort represented more than five years of successful development and testing in multiple, challenging enterprise environments. The project is in the Apache Incubator because the community was not yet diverse enough, but now the project is towards graduation. ^__^

4. What is ManifoldCF? ● Open Source crawler ○ schedule jobs to create indexes ■ get contents from repositories ■ push contents on search servers

5. What is ManifoldCF? ● Open Source crawler ○ schedule jobs to create indexes ■ get contents from repositories ■ push contents on search servers ● Out-Of-The-Box it is distributed as J2EE web apps ○ REST API ○ Authority Service ○ Crawler UI ● Can be embedded in any Java application

6. Why ManifoldCF? ● Reliability ● Incremental ● Multi repositories ● Security model ● Monitoring

7. Why ManifoldCF? - Reliability Jobs scheduling and configuration are stored in the database to maintain the state of all the executions

8. Why ManifoldCF? - Incremental Jobs can be optionally configured to re-visit contents incrementally

9. Why ManifoldCF? - Multi repositories Jobs can retrieve contents from the following repositories: ● CMIS-compliant ● Alfresco ● IBM FileNet ● EMC Documentum ● Microsoft SharePoint ● OpenText LiveLink ● Autonomy Meridio ● Memex Patriarch ● Windows Share/DFS ● Generic JDBC ● Generic Filesystem ● Generic RSS and Web

10. Why ManifoldCF? - Multi repositories Jobs can ingest contents to the following search servers: ● ElasticSearch ● OpenSearchServer ● Apache Solr ● MetaCarta GTS

11. Why ManifoldCF? - Security model Retrieve per-content ACLs

12. Why ManifoldCF? - Monitoring UI Crawler allows you to: ● configure jobs and connectors ● monitor jobs execution ● monitor contents ingestion ○ status reports ■ document status ■ queue status ○ history reports ■ simple history ■ maximum activity ■ maximum bandwidth ■ result histogram

13. Architecture ● Pull Agent Daemon ○ Jobs ■ Repository Connectors ■ Output Connectors ■ Authority Connectors

14. Architecture ● Pull Agent Daemon (the core service) ○ Jobs (execute the ingestion tasks) ■ Repository Connectors (retrieve contents) ■ Output Connectors (ingest contents) ■ Authority Connectors (retrieve ACLs)

15. Architecture

16. Architecture - Job A job is an ingestion work that consists of: ○ verbal description ○ repository connection ■ authority connection (optional) ○ metadata mapping ○ output connection (search server) ○ crawling model ○ scheduling information (on demand or time ranges)

17. Architecture - Job

18. The 0.3-incubating version ● CMIS Repository Connector ● OpenSearchServer Output Connector ● Scripting Language ● New Maven build process ● Several bug fixes

19. The 0.4-incubating version ● Alfresco Connector ● JDBC Connector now supports MySQL ● CMIS Connector upgraded to OpenCMIS 0.5.0 ● Several bug fixes

20. What's new in the 0.5-incubating ● Apache Velocity for connectors UI templates ● ElasticSearch Output Connector ● CMIS Connector upgraded to OpenCMIS 0.6.0 ● Prebuild connector support: just add jars and go! ● New Japanese localization ● Several bug fixes

21. The book: ManifoldCF in Action ManifoldCF in Action by Karl Wright published by Manning Karl is the original developer and the principal committer of Apache ManifoldCF The book is available at the following site: http://www.manning.com/wright

22. DEMO

23. Resources Homepage: http://incubator.apache.org/connectors Download page: http://incubator.apache.org/connectors/download.html

24. Thank you for your attention!

Apache ManifoldCF

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Apache ManifoldCF

Similar to Apache ManifoldCF (20)

More from Piergiorgio Lucidi

More from Piergiorgio Lucidi (14)

Recently uploaded

Recently uploaded (20)

Apache ManifoldCF