CIGNEX Datamatics Webinar-Managing a Billion Object Repository-VER23JAN'14

613 views

Published on

Content is being produced at an ever increasing rate with high-volume and high-velocity. Industry leaders strive to look for a solution which can provide a proven, reliable enterprise solution that will be the cornerstone for the future. There is a need for an innovative repository solution that can easily store, retrieve, organize and manage various types of content and records, which can be scalable and enable faster processing with ease of use.

Learn how Alfresco as Big Content Platform has aided in building that innovative repository solution for a welfare organization in the United States.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
613
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
16
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

CIGNEX Datamatics Webinar-Managing a Billion Object Repository-VER23JAN'14

  1. 1. Managing a Billion Object Repository Powered by Alfresco Date: Jan 23, 2014 Presenter: Rajesh Avatani, Alfresco Practice Lead CIGNEX Datamatics
  2. 2. About Presenter • • • CIGNEX Datamatics Confidential Alfresco Practice Lead at CIGNEX Datamatics 10+ years of Industry Experience Co-author of Alfresco 4 ECM Implementation Book (2013) www.cignex.com 2
  3. 3. Agenda • Introduction • Use Case: Social Security e-Benefits System • Need for ‘Big Content’ • Solutions Evaluated • Alfresco as Big Content Platform • About CIGNEX Datamatics • Q&A CIGNEX Datamatics Confidential www.cignex.com 3
  4. 4. Introduction 72 Hours Of video uploaded per minute (Source: YouTube) 27 Million Pieces of content shared everyday (Source: AOL & Nielson) 300 Million Average number of photos Uploaded daily on Facebook (Source: Gizmodo) 500 Million Tweets sent per day (Source: sec.gov) 4 Trillion Paper documents in the U.S. Alone (Source: Coopers and Lybrand) Content Continues to Grow CIGNEX Datamatics Confidential www.cignex.com 4
  5. 5. ECM Requirements to Manage Growing Content Search Scalability Collaboration Compliance Version Control Open Architecture Workflow Management CIGNEX Datamatics Confidential www.cignex.com 5
  6. 6. Managing a Billion Object Repository Case Study e-Benefits System for a Government Agency CIGNEX Datamatics Confidential www.cignex.com 6
  7. 7. Client Overview US based consortium managing public assistance and welfare programs for 15+ counties CIGNEX Datamatics Confidential www.cignex.com 7
  8. 8. Benefits Program Coverage Employment Services Cash Assistance Insurance My Benefits Food Stamps Childcare Healthcare CIGNEX Datamatics Confidential www.cignex.com Housing 8
  9. 9. Objectives Faster Search and Retrieval Metadata Management Versions of Correspondence Templates Scalable Centralized Document Repository Secure Access One time Migration of Existing Docs Legacy ECM Modernization Transform the Way the Citizen Correspondences are Stored, Managed, Indexed, Retrieved and Archived CIGNEX Datamatics Confidential www.cignex.com 9
  10. 10. Scalability Requirements Million Objects, Grows to Billion 60 TB Repository Size 30K Administrative Users 200K / hour Document Ingestion Rate 500 / 2 sec Search (6 months data range) 25K / day PCL to PDF Conversion CIGNEX Datamatics Confidential www.cignex.com 10
  11. 11. The need for Big Content (Unstructured Big Data) CIGNEX Datamatics Confidential www.cignex.com 11
  12. 12. Types of Big Content Records & Documents Web Logs, Emails Social Media Postings CIGNEX Datamatics Confidential Audio & Video Files Blogs & Comments www.cignex.com 12
  13. 13. Big Content Needs More Metadata Application Metadata Application Metadata Domain Specific Metadata (Brand, Product, Department) Application Metadata Application Metadata Domain Specific Metadata (Brand, Product, Department) Core Enterprise Metadata Framework (Elements Applicable to All Enterprise Content) Source: Gartner, 15 May 2013 CIGNEX Datamatics Confidential www.cignex.com 13
  14. 14. Enterprise Search is the Key • Search provides a ready entry into the Big Content • Data-centric vendors acquired search companies • Alfresco => Apache Solr (“SolrCloud” in future?) Source: Gartner, 13 May 2013 CIGNEX Datamatics Confidential www.cignex.com 14
  15. 15. Big Content Discovery & Analysis Search Engine Users Discovery Level Analysis Level Fuzzy Matching Mechanism Indexing CIGNEX Datamatics Confidential www.cignex.com 15
  16. 16. Solution Evaluated CIGNEX Datamatics Confidential www.cignex.com 16
  17. 17. Solution Evaluated (continue) Lily = Hadoop Hbase + Solr MongoDB + Solr + Liferay Pros: • Highly Scalable • API based access • Few content services such as versioning • Separation of storage Cons: • Queuing system is not robust • Performance Issues • Secure (Group) Access requires heavy customization CIGNEX Datamatics Confidential Pros: • Highly Scalable • High performance • API based access Cons • Secure (Group) Access requires heavy customization • Not a traditional ECM install & Configuration • Content services missing such as versioning, workflow, business rules www.cignex.com 17
  18. 18. Solution Evaluated (continue) Alfresco + SolrCloud + DPE Pros: • Highly Scalable • High performance • Secure Access • Separation of storage • Content services • API based access Cons: • Need to programmatically maintain index /repository consistency • Custom “Data Processing Engine” requires support CIGNEX Datamatics Confidential www.cignex.com 18
  19. 19. Alfresco as Big Content Platform Alfresco is now the largest open source content management company in the world 4 billion files | 7 million users | 3000 customers | 180 countries CIGNEX Datamatics Confidential www.cignex.com 19
  20. 20. Architecture Data Processing Engine Workload Scheduler Legacy System 15,000+ Docs/Day 200,000+ Ingestion Rate Secure & Flexible Ingestion/night 25/second Solr Search CIGNEX Datamatics Confidential www.cignex.com Content Repository Various Documents 20
  21. 21. Software • Operating System – • • File Storage – • Alfresco EE version 4.1.4 – Oracle RAC 11g 11.2.0.3 Search – SolrCloud (Apache Solr version 4.3.1 and Apache Zookeeper 3.4.5) • www.cignex.com PageTech ESB – CIGNEX Datamatics Confidential Node.js (Event driven, nonblocking I/O model for data intensive real-time applications that run across distributed devices) PCL to PDF converter – • Veritas Cluster File System Application Server : Database – • Ubuntu Server 12.04.2 ECM – • Oracle Service Bus 21
  22. 22. Data Processing Engine(DPE) • • Central Controller/ Broker Document ingestion in Alfresco including pre-processing, splitting, meta-data extraction Brokering index updates, receiving and queuing real time content updates from the ECM, pushed at a later stage to the SolrCloud index CIGNEX Datamatics Confidential www.cignex.com 22
  23. 23. Integration Custom SolrCloud Integration • Veritas Cluster File System Highly scalable – Production use cases of 3+ billion documents on such setup) • • Date range based sharding policy can be implemented Can have multiple Alfresco repositories using the same SolrCloud instance Physical Storage of Files • • • Physical storage (file system) to be isolated per county as per compliance requirements Configured “Alfresco Content Store Selector” for each county County ID (key) is the meta-data CIGNEX Datamatics Confidential • • Performance Scaling – Cluster with multi-core processors, large memories, multiple high-performance gigabit Ethernet interfaces for client access File System Scaling – Supports individual file systems of up to 256 terabytes capacity and up to a billion files per file system, with no practical limit on the number of file systems hosted by a cluster www.cignex.com 23
  24. 24. Deployment Data Processing Engine & PCL to PDF Connector Slave Master Alfresco Cluster • Servers = 9 2 Alfresco, 4 Solr (2 Solr, 2 Zookeeper), 2 DPE, 1 PDF Convertor • Memory: 120 GB RAM • CPU: 72 Cores • Storage: 16 TB File System Storage/ Annum Solr Cluster ZooKeeper Cluster Oracle Database Cluster SAN Storage CIGNEX Datamatics Confidential www.cignex.com 24
  25. 25. Solution Benefits • Scalable Platform – – • Performance – – – – – • Designed for horizontal scaling – to scale for billion objects Document Ingestion rate = 200K / hour Average Metadata Search = 300 / sec Time to Search and Retrieve PDF = 2770 / hour Time to Search and Retrieve PCL = 6920 / hour Modular Architecture – – to replace or integrate new components or applications CIGNEX Datamatics Confidential www.cignex.com 25
  26. 26. Conclusion Big Content = Platform CIGNEX Datamatics Confidential www.cignex.com 26
  27. 27. About CIGNEX Datamatics A subsidiary of Datamatics Global Services Limited CIGNEX Datamatics Confidential www.cignex.com 27
  28. 28. Who We Are? • Since 2000, delivering solutions using Open Source technologies to – – – – • Address business goals Increase business velocity Lower the cost of doing business Gain competitive advantage Portal Solutions Dramatically reduce Total Cost of Ownership (TCO) & deployment time of IT solutions Content Solutions Big Data Analytics Solutions 400+ 450+ 200+ 13 Implementations Experts Integrations Books Offices : America | India | UK | Europe | Singapore | Australia CIGNEX Datamatics Confidential www.cignex.com 5000+ Community Contributions 28
  29. 29. Where We Can Help Clients SOLUTIONS Portals Content Big Data Analytics User eXperience Platform Enterprise Content Management Liferay, Drupal, JBoss Alfresco, Drupal, Magento, CQ Hadoop Ecosystem, MongoDB, ZK, HTML5, MuleSoft JBoss, Moodle, Ephesoft, Liferay Pentaho, Talend, Solr, Jaspersoft ▪ Intranet ▪ Extranet ▪ Big Data Portal ▪ EAI ▪ Mobile Portal ▪ SOA ▪ Custom Portal ▪ WCM ▪ e-Commerce ▪ DM ▪ e-Learning ▪ RM ▪ ERP ▪ CMS ▪ Social Collaboration ▪ Imaging Solutions ▪ DAM Making Data Work ▪ Data Integration ▪ Information Delivery ▪ Data Analysis ▪ Enterprise Search SERVICES UI ▪ Development ▪ Integration ▪ Customization ▪ Migration ▪ Testing ▪ Training ▪ Support (24*7) Enterprise Mobility – Strategy ▪ Mobile UX ▪ App Development ▪ MEAP/MDM Managed Cloud Services – Develop ▪ Deploy ▪ Manage VAR/Annual Product Subscription – Liferay ▪ Alfresco ▪ Cloudera Hadoop ▪ MongoDB Extended Development Center – Center of Excellence CIGNEX Datamatics Confidential www.cignex.com 29
  30. 30. Our Content Practice Team Size: 110+ Projects: 115+ • Alfresco Platinum & Gold Partner • Acquia “Ready” partner • Magento Silver Solution Partner • 2010 - Alfresco Partner of the Year Award • 2008 – Alfresco - Best North American implementation of the year • 2007 – Alfresco - WCM implementation • Connectors/Accelerators, Frameworks • RAPIDO – Content Mgmt. & Publishing Alfresco APAC Partner of the Year, 2010 Alfresco North America Partner of the Year, 2010 • OCM – Open Source Contract Management • Smart Document Processing Solution • Migration Toolkit - Documentum to Alfresco • Drupal-MongoDB Connector • CQUAD – Drupal Magento Connector • Magento Chase PaymentTech Extension CIGNEX Datamatics Confidential www.cignex.com 30
  31. 31. Free Assessment Offer • This assessment enables companies to maximize existing IT assets and build a roadmap of the vetted Open Source technology options to: – Reduce software costs up-to 70% – Help generate opportunities for growth and service • The FREE Assessment Offer Includes: – Situation Analysis – Gap Analysis – Recommendations/Roadmap For more information, please visit: www.cignex.com/freeassessment or contact us at info@cignex.com CIGNEX Datamatics Confidential www.cignex.com 31
  32. 32. Thank You Making Open Source Work Contact Us Americas: americasales@cignex.com EMEA – emeasales@cignex.com APJ – asiapacificsales@cignex.com Learn More www.cignex.com facebook.com/CIGNEXTechnologies youtube.com/cignexglobal twitter.com/cignex

×