Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CWIN17 India / Bigdata architecture yashowardhan sowale

470 views

Published on

Bigdata architecture

  • Be the first to comment

  • Be the first to like this

CWIN17 India / Bigdata architecture yashowardhan sowale

  1. 1. 1Copyright © Capgemini 2016. All Rights Reserved Bigdata Architecture Overview
  2. 2. 2Copyright © Capgemini 2016. All Rights Reserved Gartner Hype Cycle – Emerging Technologies
  3. 3. 3Copyright © Capgemini 2016. All Rights Reserved Benefits
  4. 4. 4Copyright © Capgemini 2016. All Rights Reserved Big Data and its Dimensions Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text Streaming data and large volume data movement Scale from Terabytes to Petabytes (1K TBs) to Zetabytes (1B TBs) Having a lot of data in different volumes coming in at high speed is worthless if that data is incorrect. Organizations need to ensure that the data is correct as well as the analyses performed on the data are correct. Discovering value from multichannel datasets Variety: Velocity: Volume: Veracity: Value:
  5. 5. 5Copyright © Capgemini 2016. All Rights Reserved Applications for Big Data Analytics Homeland Security FinanceSmarter Healthcare Multi-channel sales Telecom Manufacturing Traffic Control Trading Analytics Fraud and Risk Log Analysis Search Quality Retail: Churn
  6. 6. 6Copyright © Capgemini 2016. All Rights Reserved Manage  Data governance and security  Data privacy  Compliance  Collaboration  Value generation  Program delivery  Data-driven culture  Information strategy  Skill development  Master data mgmt  Metadata mgmt  Data quality mgmt  Operations, SLA’s  Orchestration General reference architecture for Big Data Analytics ValueActInsightAnalyzeInformationProcessSource data Customer profitability Operational cost cutting Risk prevention Market share increase Business Applications  Customer campaign  Trigger activity Business Processes  Trigger event  Adjust process Decision makers  Approve/reject business opportunities  Develop new business models and products Customer Experience Operational Process Optimization Risk, Fraud Disruptive Business Model Search What is relevant? Explorative How does it work? Descriptive What happened? Diagnostic Why did it happen? Predictive What will happen? Prescriptive How to act next? Data asset descriptions Processed data  Measures, KPI’s  Dimensions, Master data Granular data  Events  Context information Ingest Catalog Stream Store Prepare Refine, blend Manage lifecycle Internal data  IT managed applications (ERP, SCM, CRM)  Master and reference data  Business owned informal data  Documents, mail, images, voice, video  Web and mobile apps  B2B  Internet, Social, Internet of Things (machine, sensor)  Third party data: market, weather, climate, geolocation  Open data External Data Business performance Performance improvement Mask
  7. 7. 7Copyright © Capgemini 2016. All Rights Reserved The BDL is also aligned with our principles  Unleash Data and Insights as-a-service Make Insight-driven Value a Crucial Business KPI Empower your People with Insights at the Point of Action Develop an Enterprise Data Science Culture Master Governance, Security and Privacy of your Data Assets Enable your Data Landscape for the Flood coming from Connected People and Things Embark on the Journey to Insights within your Business and Technology Context 1 2 3 7654 It concerns both Business and (disruptive) Technology It works with high volumes of all kinds of data It integrates Unified Data Management capabilities to manage governance, security, privacy, MDM, RDM, etc it also comes with a new, specific mindset that has to be addressed at the Enterprise level We (Capgemini) intend to offer the BDL as-a-Service Bringing Business Value by delivering Insights at the Point of Action is the motto of the BDL 1 2 3 7 654
  8. 8. 8Copyright © Capgemini 2016. All Rights Reserved Business Data Lake Reference Architecture - Conceptual Characteristics  Store-anything; analyze everything  Blend traditional data elements with new data types  Manage centrally, govern locally  Future-proof design  Highly scalable and available Data Access Layer Data Distillation Layer Data Quality Governance Framework (Business Rules, Transformation, Aggregation) Customer Master (CRM) Data Lake Layer Landing Self-service 4 Data Ingestion LayerExtract & Load Streams 3 Structured data Sources 2 1 ODS SandboxSQL-on-Hadoop In-Memory Grid Data Visualization and Reporting Advanced Analytics Data Virtualization Or Blending Marts DataGovernance(Audit,Lineage) 7 MetadataManagement Transactional Systems(RES/CRM) Un/Semi-Structured Data Sources Data Dissemination Layer Data Provisioning Layer HR Mart 1 HR Mart 2 Distributed Compute Layer / Services Distributed Storage Layer Data Governance Integration APILayer 11 6 5 DataSecurity(Authentication,Authorization,Kerberos) 8 9 10
  9. 9. 9Copyright © Capgemini 2016. All Rights Reserved Business Data Lake Reference Architecture - Logical Talend 6.3 or latest Data Access Layer Data Distillation Layer Data Quality Governance Framework (Business Rules, Transformation, Aggregation) Customer Master (CRM) Data Lake Layer Landing 4 Data Ingestion LayerExtract & Load Streams 3 Structured data Sources 2 1 ODS SandboxSQL-on-Hadoop In-Memory Grid Data Virtualization Or Blending Marts DataGovernance(Audit,Lineage) 7 MetadataManagement Transactional Systems(RES/CRM) Un/Semi-Structured Data Sources Data Dissemination Layer Data Provisioning Layer HR Mart 1 HR Mart 2 APILayer 11 6 5 DataSecurity(Authentication,Authorization,Kerberos) 8 9 10 Ranger, Knox Atlas Hortonworks HDP 2.5 or latest Spark HBASE Hive HBASE / Hive Datamarts Redshift Zeppellin RESTful Service Self-serviceData Visualization and Reporting Advanced Analytics Spark Streaming/Storm Kafka
  10. 10. 10Copyright © Capgemini 2016. All Rights Reserved Detailed layer breakup
  11. 11. 11Copyright © Capgemini 2016. All Rights Reserved Reference architecture for data ingestion - Indicative Functionality: Ingest Data from a variety of sources and with varying latency, into the Data Lake Data Integration Services S/FTP based push (Logs, text, other file based) Changed Data Management (Delta extracts, event mgmt) Data Sourcing Source Extraction Services (XML, Relational, Other extracts) DataTransformation Transformation Services Fast Data Manipulation • Sorting • File Merges • Joins • File Splitting • Others Transform Routines • Aggregation • Mappings • Lookups • Calculations • others Metadata Management Automation Services Deployment (Job & others) Error Handling Clustering & Capacity Common Services Data Sources (Structured, Semi-Structured, Unstructured) DataState Data at Rest (ETL pushdown, batch using standard DI tools or Sqoop) Data in Motion (Fast data, processed via tools like Flume, Storm, Spark, etc) Data Persistence Big Data Transformations • User-defined functions / custom MR code (Java, Python etc.) for complex logic ETL Pushdown Processing (Execute mapping jobs on Hadoop cluster on HDFS/Hive/Spark….) Characteristics  The Data Ingestion design principles are based on integrating raw data characterized by extreme scale and variability, and making provisions for both ‘data at rest’ (batch) and ‘data in motion’ (low latency)  The framework combines traditional data integration methodologies leveraging the Extract-Transform-Load approach and extends it to also process semi-structured and unstructured data elements.  The classical model of tracking data elements through their lifecycle and providing for lineage can be added in this framework.
  12. 12. 12Copyright © Capgemini 2016. All Rights Reserved Data Acquisition and Reconciliation The Data Reconciliation is part of data quality and ensures data integrity in the data lake. Reconciliation process checks if the data has been loaded properly to ensure accuracy and completeness of the data Master Data – This is a fairly simple process as the Master Data is not subject to frequent changes. The granularity of the data remains the same in the source and the target Transactional Data – Reconciliation of the Transactional Data is instrumental to the success of the big data systems. Reconciliation can happen on the entire data set or on the incremental data based on the method by which the data is ingested Separate metadata tables / files are designed specifically for reconciliation. These tables/ files are populated with reconciliation queries and reconciliation reports are generated after data is loaded into the data lake. Data Reconciliation (Optional) The Data Acquisition can be described as combination of Landing Zone & Data validation, Delta Detection & Data Enrichment Landing Zone – It is an area wherein data from all the source systems across client’s landscape will land for the utilization/consumption by downstream systems Data validation – It is the first check point or zone wherein the MDM based checks will be applied on the incoming source data files. Delta Detection : This will be applicable to the data feeds from those source systems which have the capability to send/provide incremental delta data for the regular ongoing data processing into data lake solution. Data Enrichment : Data enrichment refers to processes used to enhance, refine or otherwise improve raw data. Data from various enrichment sources will be pushed to data lake via Landing zone for enrichment of existing data. Data Acquisition
  13. 13. 13Copyright © Capgemini 2016. All Rights Reserved Data Distillation in the Data Lake: approach to provisioning for data consumption Characteristics  Uniform approach for distillation of information from the data lake  A centralized Data Quality engine for application of uniform data quality rules across the enterprise  An Integrated Data Quality function to cleanse, standardize, enrich and de-duplicate data  Console for Design, Development & Validation of rules  Data Quality Services for Integration with operational systems, MDM  A Exception Management solution for resolving data issues and errors.  Data quality process running on the data will be translated into MapReduce for faster processing. Data Persistence Layer Distillation Layer AGGREGATION EXTRACT TRANSFORM Σ SECURE DATA QUALITY STORE DATA QUALITY CONSOLE DATA QUALITY ENGINE DATA PROFILING DATA CLEANSING MATCH & MERGE DATA ENRICHMENT RULE MANAGER DQ META-DATA DATA DASHBOARD EXCEPTION MANAGEMENT DATA QUALITY CONFIGURATOR EXCEPTION REPOSITORY DQ MART Functionality: Ability to ingest data from the storage tier and convert it to structured data for easier analysis by downstream applications. This is done through a combination of Extraction, transformation and aggregation of high quality data from the Data Lake and making it available for Analytical and Reporting Applications. Transformation will also involve data quality checks and corrections like profiling, validating, cleansing structured and unstructured data based on Business rules. Data is distilled (or prepared) on a per-function basis, and made available for consumption. This is consistent with the design practice of ‘manage data centrally and provision locally’
  14. 14. 14Copyright © Capgemini 2016. All Rights Reserved Data Persistence Layer : Schema on Read & Distill on Demand Namenode Hadoop Distributed File System (HDFS) Datanodes Replication Job / Task Tracker Storage Cluster/Rack Characteristics  Deliver a single, comprehensive view of all data, across functional areas – to conduct deep analysis  Multi-tiered Data Lake that serves distinct functionalities – e.g., Landing, staging and curated stores  A landing area containing both traditional data as well as non-traditional data – characterized by attributes of value, veracity, volume, velocity and variety  Eliminate the need for upfront schema design and rigid pre-configured models  Easy and cost-effective configuration for scale up and scale down  Store everything, distill on demand Landing Staging Data Lake Curated Audit Metadata Search Data Ingestion Functionality: Create a single repository for information and deliver a single, silo-less store to handle all types of data for all reporting, analysis and discovery requirements
  15. 15. 15Copyright © Capgemini 2016. All Rights Reserved Approach to Data Provisioning DataAccessLayer Data provisioning Discovery Platform / Sandboxes Analytical Views Data Virtualization DataDissemination HR Mart 1 HR Mart 2 HR Mart 3 HR Mart 4 Characteristics  The Data Marts & Aggregate Structures layer will include subject specific data mart structures which can be used by various tools to retrieve data and information. This layer will also support User specific Sandbox for power users to perform various activities such as data mining, identifying data patterns, running analytical and statistical model using various tools  If required, there will be multiple versions of the subject areas for different production streams  Data marts and aggregate structures such as summary tables will be created based on business and performance requirements. As far as possible, database managed aggregates such as computed views and indexes will be created to reduce ETL based data movement  Data Virtualization will address combining datasets from multiple data stores across various layers in the data lake stack. Functionality: Provision data-sets to create various combinations of custom views – by specific functions/departments and also cross- functional access
  16. 16. 16Copyright © Capgemini 2016. All Rights Reserved © David Feinleib 16

×