Managing a Billion Object Repository

839 views

Published on

Social security benefit systems are the action programs of government intended to promote the welfare of the population through assistance measures guaranteeing access to sufficient resources for food and shelter and to promote health and well-being for the population. With the high rise in population, government and non-government organizations manage high wealth of data like medical diagnostics, infrastructure management, personalized benefits and emergency services. Content is being produced at an ever increasing rate with high-volume and high-velocity. Industry leaders strive to look for a solution which can provide a proven, reliable enterprise solution that will be the cornerstone for the future. There is a need for an innovative repository solution that can easily store, retrieve, organize and manage various types of content and records, which can be scalable and enable faster processing with ease of use.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
839
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Social security benefit systems are the action programs of government intended to promote the welfare of the population through assistance measures guaranteeing access to sufficient resources for food and shelter and to promote health and wellbeing for the population. With the high rise in population, government and non-government organizations manage high wealth of data like medical diagnostics, infrastructure management, personalized benefits and emergency services. Content is being produced at an ever increasing rate with high-volume and high-velocity. Industry leaders strive to look for a solution which can provide a proven, reliable enterprise solution that will be the cornerstone for the future. There is a need for an innovative repository solution that can easily store, retrieve, organize and manage various types of content and records, which can be scalable and enable faster processing with ease of use.Join Alfresco and its Platinum Partner CIGNEX Datamatics, to learn how Alfresco has aided in building the largest repository solution for a welfare organization in the United States.
  • At the bottom level of the architecture, a core enterprise metadata framework provides a small set of metadata elements that are applicable to the majority of enterprise information assets under management. The second layer of the Big Content metadata architecture consists of domain specific elements that are not necessarily applicable to all enterprise content, but are useful to a particular area such as a brand, product or department.The top layer of the metadata architecture consists of application specific metadata. This is additional information about content that is only relevant to the use-case at hand and the application facilitating its execution.
  • Search operates at two levels in a Big Content environment: Discovery and Analysis. At the discovery level search functions much as it does on the web or in traditional enterprise search. It provides a single, comprehensive index of available information assets against which queries are matched and relevant content is retrieved.It provides a single, comprehensive index of available information assets against which queries are matched and relevant content is retrieved. Beyond simple information retrieval, this sort of discovery facilitates the deeper analysis at the heart of Big Content. Fuzzy - Search is also good at finding things that are “close enough” by managing spelling variants, synonyms, related content and other fuzzy matching mechanisms. This is extremely useful when attempting to uncover nuggets of information scattered across and hidden within large amounts of heterogeneous content.
  • Managing a Billion Object Repository

    1. 1. Managing a Billion Object Repository November 13, 2013 Munwar Shariff CIGNEX Datamatics #SummitNow
    2. 2. About the Speaker • Co-Founder & Chief Technology Officer of CIGNEX Datamatics • 23+ years of Industry Experience • Author of the First Alfresco Book (2006) • Certified Alfresco Trainer • Author of Five Technical Books #SummitNow #SummitNow
    3. 3. Agenda • • • • • Use Case: Social Security e-Benefits System The need for “Big Content” Solutions Evaluated Alfresco as Big Content Platform Summary #SummitNow #SummitNow
    4. 4. Use Case: Social Security e-Benefits System #SummitNow #SummitNow
    5. 5. Program Coverage Employment Services Cash Assistance Insurance My Benefits Food Stamps Childcare Healthcare Housing #SummitNow #SummitNow
    6. 6. Program Objectives • Scalable Centralized Document Repository • One-time migration of existing docs (~10 yrs of archives) • Secure access • Meta-data Management • High Performance Search and Retrieval • Correspondence Templates (Versioned) #SummitNow #SummitNow
    7. 7. Scalability Requirements • ~500 Million objects, grows to Billion • ~60 Million objects added per year • Estimated repository size = 60TB • Administrative Users = 30,000 • Document Ingestion rate = 200,000/ hour • Search (6 months date range) = 500/ 2 sec • PCL to PDF conversion = 25,000/ day #SummitNow #SummitNow
    8. 8. The need for Big Content (Unstructured Big Data) #SummitNow #SummitNow
    9. 9. Types of Big Content Social Media Postings Audio & Video Files Web Logs, Emails Records & Documents Blogs & Comments Source: Gartner, 17 Oct 2012 #SummitNow #SummitNow
    10. 10. Big Content Needs More Metadata Application Metadata Application Metadata Domain Specific Metadata (Brand, Product, Department) Application Metadata Application Metadata Domain Specific Metadata (Brand, Product, Department) Core Enterprise Metadata Framework (Elements Applicable to All Enterprise Content) Source: Gartner, 15 May 2013 #SummitNow #SummitNow
    11. 11. Enterprise Search is the key • Search provides a ready entry into the Big Content • Data-centric vendors acquired search companies • Alfresco => Apache Solr (“SolrCloud” in future?) Source: Gartner, 13 May 2013 #SummitNow #SummitNow
    12. 12. Big Content Discovery & Analysis Search Engine Users Discovery Level Analysis Level Fuzzy Matching Mechanism Indexing #SummitNow #SummitNow
    13. 13. Solutions Evaluated #SummitNow #SummitNow
    14. 14. Technical Requirements 1. 2. 3. 4. 5. 6. Scalable Repository, High Ingestion Rate High Performance Search and Retrieval Secure access at “county level” (group) Compliance on storage (physical separation) Version Control, Workflow & Business Rules Web Services API for external access #SummitNow #SummitNow
    15. 15. 1. MongoDB + Solr + Liferay Pros: • Highly Scalable • High performance • API based access Cons: • Secure (Group) Access requires heavy customization • Not a traditional ECM install & Configuration • Content services missing such as versioning, workflow, business rules #SummitNow #SummitNow
    16. 16. 2. Lily = Hadoop Hbase + Solr Pros: • Highly Scalable • API based access • Few content services such as versioning • Separation of storage Cons: • Queuing system is not robust • Performance Issues • Secure (Group) Access requires heavy customization #SummitNow #SummitNow
    17. 17. 3. Alfresco + SolrCloud + DPE Pros: • Highly Scalable • High performance • Secure Access • Separation of storage • Content services • API based access Cons: • Need to programmatically maintain index /repository consistency • Custom “Data Processing Engine” requires support #SummitNow #SummitNow
    18. 18. Alfresco as Big Content Platform go big or go home… #SummitNow #SummitNow
    19. 19. Architecture Data Processing Engine Workload Scheduler Legacy System 15,000+ Docs/Day Solr Search 200,000+ Ingestion Rate Secure & Flexible Ingestion/night 25/second Content Repository Various Documents #SummitNow #SummitNow
    20. 20. Software • • • • • • • • Operating System : Ubuntu Server 12.04.2 ECM = Alfresco EE version 4.1.4 Database = Oracle RAC 11g 11.2.0.3 File Storage = Veritas Cluster File System Search = SolrCloud (Apache Solr version 4.3.1 and Apache Zookeeper 3.4.5) Application Server : Node.js (Event driven, non-blocking I/O model for data intensive real-time applications that run across distributed devices) PCL to PDF converter = PageTech ESB = Oracle Service Bus #SummitNow #SummitNow
    21. 21. Data Processing Engine (DPE) • Central controller/ broker • Document ingestion in Alfresco including preprocessing, splitting, meta-data extraction • Brokering index updates, receiving and queuing real time content updates from the ECM, pushed at a later stage to the SolrCloud index #SummitNow #SummitNow
    22. 22. DPE Highlights • • • • Asynchronous I/O enabling high data ingestion/export throughput Flexibility of consistency models: • for batch operations - eventual consistency • for online operations - transactional consistency Distributed processing model: DPE can scale up horizontally by distributing processing across multiple nodes with co-ordination handled using messages/event bus Extensible Synchronization: Synchronization can be extended to multiple indexing engines that can support additional operations such as statistical and analytical, semantic search (RDF/SPARQL), or graph traversals #SummitNow #SummitNow
    23. 23. Custom SolrCloud Integration • Highly scalable (production use cases of 3+ billion documents on such setup) • Date range based sharding policy can be implemented • Can have multiple Alfresco repositories using the same SolrCloud instance #SummitNow #SummitNow
    24. 24. Veritas Cluster File System • • Performance Scaling • cluster with multi-core processors, large memories, multiple high-performance gigabit Ethernet interfaces for client access File System Scaling • supports individual file systems of up to 256 terabytes capacity and up to a billion files per file system, with no practical limit on the number of file systems hosted by a cluster #SummitNow #SummitNow
    25. 25. Physical Separation of Files • Physical storage (file system) to be isolated per county as per compliance requirements • Configured “Alfresco Content Store Selector” for each county • County ID (key) is the meta-data #SummitNow #SummitNow
    26. 26. Deployment #SummitNow #SummitNow
    27. 27. Hardware • • • • Number of Servers = 9 • 2 Alfresco, 4 Solr (2 Solr, 2 Zookeeper), 2 Data Processing Engine, 1 PDF Convertor 120 GB RAM 72 CPU Cores 16 TB File System Storage per annum #SummitNow #SummitNow
    28. 28. Solution Benefits • • • The solution architecture is designed for horizontal scalability – to scale really big considering future requirements The proposed design supports “Performance SLAs” considering the load and number of people who would access the system We have considered “modular” approach in our design to replace the components in future if there is a need to do so #SummitNow #SummitNow
    29. 29. Conclusion Big Content = Platform #SummitNow #SummitNow
    30. 30. About CIGNEX Datamatics #SummitNow #SummitNow
    31. 31. Where we help our customers… #SummitNow #SummitNow
    32. 32. #SummitNow

    ×