Capgemini Leap Data Transformation Framework with Cloudera

1,561 views

Published on

https://www.capgemini.com/insights-data/data/leap-data-transformation-framework

The complexity of moving existing analytical services onto modern platforms like Cloudera can seem overwhelming. Capgemini’s Leap Data Transformation Framework helps clients by industrializing the entire process of bringing existing BI assets and capabilities to next-generation big data management platforms.
During this webinar, you will learn:
• The key drivers for industrializing your transformation to big data at all stages of the lifecycle – estimation, design, implementation, and testing
• How one of our largest clients reduced the transition to modern data architecture by over 30%
• How an end-to-end, fact-based transformation framework can deliver IT rationalization on top of big data architectures

Published in: Software
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,561
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
31
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Capgemini Leap Data Transformation Framework with Cloudera

  1. 1. 1 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Is your Big Data journey stalling? Take the Leap with Capgemini and Cloudera Industrializing your transition to the Modern Data Landscape |
  2. 2. 2 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Speakers Andrea Capodicasa Senior Solution Architect Insights & Data Goutham Belliappa Big Data practice leader Insights & Data Alex Gutow Senior Manager, Product Marketing
  3. 3. 3 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Agenda • The Case for Change • Industrializing the Change • Adoption • Q&A
  4. 4. 4 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Capgemini Insights & Data Global Practice Global reach with over 13,000 professionals across 40+ countries with over 500 Big Data & Data Science professionals, including 100+ Hadoop certified consultants with over 500 Big Data & Data Science professionals, including 100+ Hadoop certified consultants We employ >13,000 information management specialist practitioners, deployed across Capgemini’s global network We employ >13,000 information management specialist practitioners, deployed across Capgemini’s global network We were recognised again by Gartner as one of the 4 leading information service providers globally We were recognised again by Gartner as one of the 4 leading information service providers globally Capgemini Insights & Data Global Practice since 2015, delivering business & IT Insights and data services Capgemini Insights & Data Global Practice since 2015, delivering business & IT Insights and data services Capgemini has a global reach and local presence in 44 Countries and over 100 Languages Capgemini has a global reach and local presence in 44 Countries and over 100 Languages
  5. 5. 5 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | The case for change
  6. 6. 6 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Information Trends: What are seeing in the market place? Recent years have brought unprecedented changes to the Information landscape. Each of these “disruptors” have individual momentum and collectively represent significant opportunity to improve an organization’s effectiveness. Successful CIOs and leaders consciously take these trends into consideration when planning the evolution of their information architecture. Empower the business by focusing from the “user down”, not the “system up”. Modeling business requirements months or even years in advance and IT delivering a multi year plan to rollout a solution that may not apply in a fast changing business environment are long gone Ms. Agility killed Mr. Waterfall The availability of “finished” business functions within the cloud provides organizations with tremendous opportunities while increasing IT information challenges Cloud Computing Open source architecture provides substantial development and complexity cost savings vs. legacy software packages. Open Source Software as a Service offerings in Big Data, Data Transformation & finished analytics are removing the infrastructure bottle necks of servers, software and maintenance from obstructing speed to market As a Service The proliferation of web-connected IP devices creates a “hyper-evolving” cyber breach potential for organizations; privacy laws create compliance challenges with mobile devices Security & Privacy Traditionally data dictionaries have been single purpose and technically focused. As data becomes more valuable and the same information is used in multiple ways, then the need for Business Meta-data will become critical Business Meta-Data Has resulted in data where segments are loosely connected and correlations are at times non-intuitive, requiring new ways to mine and derive insights Social Computing Massive in-memory databases with intensely complex analytics are highly scalable -- change anything, anytime, and simultaneously compare the results of multiple scenarios in seconds In Memory Analytics Describes the transition from historical or hind-sight indicators to insight and foresight indicators and visualizations. “Real” Analytics
  7. 7. 7 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Customers are Looking for a Guide
  8. 8. 8 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Cloudera Enterprise Making Hadoop Fast, Easy, and Secure A new kind of data platform • One place for unlimited data • Unified, multi- framework data access Cloudera makes it • Fast for business • Easy to manage • Secure without compromisePublic Cloud Private Cloud Hybrid Environments Hybrid Deployment Flexibility OPERATIONS DATA MANAGEMENT STRUCTURED UNSTRUCTURED PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT SECURITY NoSQL STORE INTEGRATE BATCH STREAM SQL SEARCH OTHER OTHERFILESYSTEM RELATIONAL
  9. 9. 9 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | The traditional approach to BI & Analytics is a bottleneck in the operational value chain Traditional BI & Analytics approachTraditional BI & Analytics approach • Centralised BI teams too monolithic and divorced from the business operations • Insights latency • Reporting on the past, limited ability to predict and prescribe what is needed now • Each new business question asked = more time required to crunch the right data • Heavy duplication in operational data throughout the BI layers & systems • Diluted data quality & governance create risks of security breach, compliance issues & risk exposure • Significant costs – infrastructure and people. • Limited ability to scale - either from organic data volumes growth or increasing data complexity
  10. 10. 10 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | The Insights-driven enterprise puts information at the centre and insights “at the point of action” Next Generation approachNext Generation approach • Next-generation data management platform enabling a pervasive, real-time “insights & data fabric” serving operations • Standardized & cost effective data management, allowing high agility on insights and the ability to “ask any questions” • Operational applications provide data and integrate insights back in a continuous improvement loop • Operations integrate predicted best outcomes to optimise business processes, automatically where possible • Ability to detect and catch events on the fly that will require immediate action (e.g. fraud detection) for optimal reaction or proactive action • Coherent management of platforms & data management processes, with insights & data science skills embedded directly in the operational units for maximum impact • Optimized total cost of ownership (TCO) with a rationalized and simplified data landscape
  11. 11. 11 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | OPERATIONS DATAMANAGEMENT UNIFIED SERVICES PROCESS,ANALYZE, SERVE STORE INTEGRATE Key challenges blur the vision on both the target and the journey to the Insights-driven enterprise Challenges addressed “Which data should we retain and/or which data could we archive?” “I don’t know how to drive value from my data” “Can I decrease costs by moving my data (landscape) to the cloud or As-A-Service” “How mature is my data landscape in comparison to the best industrial trends?” “I have been told to“ do something” about big data analytics but don’t know where to start” “Can the Business Intelligence landscape be optimized to derive the maximum value out of it?” “Our data landscape is scattered, complex and very expensive, can we fix it?” Value created A modern data strategy will enable: Reduced complexity: Rationalizing the data strategy to meet demand Lower cost: Reduce the operating cost of your data strategy Increased agility and better time to market: More speed in the development of new information applications More/Better insights and return on intelligence: Ease to derive meaningful insights and enable business transformation Less risk: Reduce complexity of the data strategy Data security & privacy: Make your data strategy compliant with rules and regulations
  12. 12. 12 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Industrializing the change
  13. 13. 13 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | MisuraMisura DiligentDiligent IdemIdem BlendBlend PapillonPapillon VirtuVirtu Capgemini’s Leap Data Transformation Framework Modules overview Essence (Semantic Layer consolidation) Analyze existing semantic layer of architecture Identify potential functional overlap and produce recommendations for consolidation Data concierge Business Information Catalog Self service ingestion, distillation, analytics Data Operations Services Estimation Discovery Design/Build Testing Agile environment provisioning Continuous Integration lifecycle One-Click leap Optimize/reduce transformation scope Optimize reporting design Optimize SQL Industrialize end to end testing Estimate the transformation effort Optimize ETL semantic design
  14. 14. 14 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Diligent / Blend Applications Business Problem Large and complex DW estates have been built over the last 20 years or, so and the infrastructure hosting them might need update A number of reports and underlying tables will be duplicated or not utilised anymore – they can be decommissioned saving valuable resources Users are reluctant to give up “their” reports/data when migrations programmes occur Solution Scope reduction through identifying current BO reports that are not used. Up to 40% discovered with a customer of ours Scope reduction in identifying reports that are duplicates or share a number of data items. Automated method to migrate BO reports to Pentaho, hence reduced workload and reduced errors. A scientific and objective approach to measure which data are actually used Diligent BO Audit data explorer to identify interactions between users and Universes / Reports and tables Diligent BO Meta data gathering Module to extract Universe and report information. Blend Report merger to identify reports reduction Blend XML Generator to create Pentaho reporting cubes from Diligent gathered metadata. DiligentDiligent BlendBlend Accelerator Results
  15. 15. 15 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | IDEM-DA Business Problem The customer has very strict security and normalisation requirements when loading their data, they need different obfuscation types for different “semantic types pre” e.g. names, phone numbers, social security numbers. Etc. Left it as a manual activity, this would imply a laborious and time consuming identification of hundred of thousands of columns – a costly and error prone activity Solution Automated identification of tables columns for encryption, and standardisation Automated creation of ETL meta-data spreadsheets which drive Data Acquisitions Pentaho jobs for data migration Accelerator Results Manual generation of meta-data spreadsheet: Several Days - Weeks IDEM-DA: 15mins - 2 hours Manual eyeballing of data – human errors. Can take hours to several days IDEM-DA: Approximately 70% reduction and more accurate identification of known types Project manager of Data Migration project: “IDEM-DA is the only way forward” IdemIdem
  16. 16. 16 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Example table IDEM-DA Column Name Dataset mob_no 07710232931,07083210302 email example@hotmail.com, hello@gmail.com free_text_field My address is 12 lucky street, London, E12 2TF serial_id 11234, 22313, 3231313 Semantic Type MOBILE_NO EMAIL Address UNKNOWN IDEM-DA IDEM-DA is a Module used to support the ETL from legacy data warehouses into Modern architecture IdemIdem
  17. 17. 17 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | IDEM-ES Business Problem The customer has a load pattern called “cutover+delta” – historical tables are updated with daily files Although many tables have most of the columns with similar names, Left it as a manual activity, this would imply a time consuming identification of hundred of thousands of columns – a error prone activity Solution Machine learning based solution to automatically identify similarity between columns (humanly supervised) Column name similarity (ngrams) Column content similarity (ngrams) Column content agnostic distribution (hist) Open architecture to automatically evaluate best model (tested 600+) Automated creation of INSERT INTO ETL scripts Accelerator Results - Acceleration expected around 30-50%Can automatically generate SQL insert statements to create the current view IdemIdem
  18. 18. 18 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | IDEM-ES IdemIdem
  19. 19. 19 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | IDEM ES IdemIdem
  20. 20. 20 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Virtu – Data testing Framework Business Problem Testing data migrations – and in general integrity of data transformations in large scale BI/DW estates is complicated Thousands of objects moved across during the migration – and when in production loaded every day might lead to hundred of defects – without an automated system to keep track of all of them can become a daunting task Continuously monitoring of the DQ performance and regression error history is essential to maintain acceptable levels of quality Solution Benefits • Customer can easily plan and execute a large amount of checks – completely controlling their lifecycle (creation, modification, decommissioning) • Configurable engine to store details of defects to have maximum visibility and transparency on errors and their resolutions • Native connection to modern defect management systems (Jira) – and easily expandable to any systems with reachable API • DQ dashboard gives real time and drillable information on current DQ state • Compatible with 3 system types – Oracle, Impala & MySQL A complete e2e testing framework that accelerates the configuration, execution and evaluation of tests for large scale BI domains Comprised of Web UI for maximum user friendliness in configuration Scheduler engine to launch configurable batches of tests Real time Defect manager for timely defects issuing and progress check DQ dashboard for monitoring state and progress
  21. 21. 21 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Virtu – Testing Framework
  22. 22. 22 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Virtu – Testing Framework
  23. 23. 23 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Adoption
  24. 24. 24 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Leap Data Transformation Framework is the result of a client co-innovation process and delivered efficiencies on large projects Capgemini client in Public Sector is building a Business Data Lake (BDL) to support all digital channels interactions as well as rationalize/optimize its IT Business Intelligence legacy landscape on top of the new Big Data architecture In the scope of the IT Rationalization project, 10+ data warehouses, hundreds of analytical business services, and thousands of BO reports must be moved on top of the BDL, for thousands of business users throughout the organization. In this context, Leap Data Transformation Framework was used on a 1st business scope Leap is a framework consisting of a transformation methodology and accelerators across the transformation lifecycle which can operate at scale: The methodology is modular and covering all phases of transformations Elements of the Discovery phase were automated Design and Build process automation (metadata driven) and application deployment controls delivered development efficiencies and scalability A metadata driven test automation framework reduced initial test effort and subsequent regression test activities A Continuous Development process Platform application stack deployment efficiencies ApproachApproach Key OutcomesKey Outcomes Accelerator ResultsAccelerator Results An end to end, fact-based transformation framework to deliver IT Rationalization on top of Big Data architectures 40% reduction of the transformation scope DiligentDiligent 40% reduction of the transformation scope Diligent 15% efficiency in the design/build process through use of: • Semi-Automated ETL code optimizer • Semi-Automated SQL optimizer • Semi-Automated report optimizer Idem Papillon BlendIdem Papillon Blend 15% efficiency in the design/build process through use of: • Semi-Automated ETL code optimizer • Semi-Automated SQL optimizer • Semi-Automated report optimizer Idem Papillon Blend 10% efficiency in the test development process (1st pass) & 30% efficiency in regression testing through: • Automated test & assurance framework VirtuVirtu 10% efficiency in the test development process (1st pass) & 30% efficiency in regression testing through: • Automated test & assurance framework Virtu
  25. 25. 25 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Use cases for Capgemini’s Leap Data Transformation Framework for optimized business data lakes For advanced clients embracing the potential of modern architectures Opportunity to transform, simplify and rationalize an organization’s data landscape for optimized TCO Leap Data Transformation full suite enables risk and cost reduction working well in an agile approach For advanced clients embracing the potential of modern architectures Opportunity to transform, simplify and rationalize an organization’s data landscape for optimized TCO Leap Data Transformation full suite enables risk and cost reduction working well in an agile approach ReplatformingReplatforming For clients in need of better visibility of their current data assets before moving to Big Data Leap Data Transformation Framework can help optimize current data management processes, reduce substantially transformation scope, identify the optimal platform for the workloads and shape a future project for success For clients in need of better visibility of their current data assets before moving to Big Data Leap Data Transformation Framework can help optimize current data management processes, reduce substantially transformation scope, identify the optimal platform for the workloads and shape a future project for success Legacy Discovery/DW optimizationLegacy Discovery/DW optimization Capgemini takes over current BI estate and modernizes it through its NextGen BISC approach For clients with redundant and expensive DW estates concerned about risks to move to modern architectures Leap Data Transformation Framework full suite is a key element to optimize the TCO and ensuring quality in the transformation process Capgemini takes over current BI estate and modernizes it through its NextGen BISC approach For clients with redundant and expensive DW estates concerned about risks to move to modern architectures Leap Data Transformation Framework full suite is a key element to optimize the TCO and ensuring quality in the transformation process Managing existing BI & move to modern architectures Managing existing BI & move to modern architectures For clients needing to automate their data testing in big data environments or large relational environments Tools can automate the testing lifecycle for both big data and traditional relational DW estates For clients needing to automate their data testing in big data environments or large relational environments Tools can automate the testing lifecycle for both big data and traditional relational DW estates TestingTesting
  26. 26. 26 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Replatforming legacy BI applications requires strong strategies for user adoption and decommissioning Strong user adoption strategy End users understand the new value they will get out of the new system They are empowered to use it Their success is spreading to new initiatives • They forget all about the old & slow stuff fairly quickly Weak user adoption strategy End users fear the new system will impact their capacity to do their jobs The known is safer than the new First tests on the new systems disappoint, any failure goes viral Evolutions still run on the old system, “just in case” Strong kill strategy Systems are killed according to roadmap, costs linked to unused HW & SW are recovered IT & Business impacts are anticipated, managed and communicated The energy is focused on the new Weak kill strategy First systems are shut down ignoring business constraints, impacting operations Endless hours spent to compare the old and the new and explain differences Unprepared board escalations when unplanned impacts arise THE USER ADOPTION STRATEGY THE KILL STRATEGY
  27. 27. 27 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Sample Table of contents for the output of a 4 week Data Warehouse Optimization roadmap based on LEAP Data Extract & Staging Data Management & EDW Semantic Layer Sandbox & Analytics Operational Analytics Data Virtualization Layer Master Data Management Metadata Management Data Distribution Layer Our Understanding Big Data Trends in Heavy Equipment /farm Industry Technology Principles Reference Architecture – Conceptual Architecture – Architecture Components Technology Choice Points – ETL tool comparison – EMR vs. Hadoop ETL & Data Offloading Plan – Project Structure, Sequence, Sprints – Assumptions – Collaborative Planning & Prep Logical Architecture Business Value Proposition Current State Architecture End State Architecture Current State + 6 months Architecture Current State + 12 months Architecture Current State + 18 months Architecture Data Distribution Layer
  28. 28. 28 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | What’s next?
  29. 29. 29 © Cloudera, Inc. All rights reserved. Copyright © Capgemini 2016. All Rights Reserved. | Contact our experts Schedule a discovery session with our experts Schedule a discovery session with our experts Schedule a first assessment of the value of Leap for your organization Schedule a first assessment of the value of Leap for your organization Goutham Belliappa Goutham.belliappa@capgemini.com https://www.linkedin.com/in/gouthambelliappa Andrea CAPODICASA Andrea.capodicasa@capgemini.com Duane Garrett duane@cloudera.com

×