Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Engineering: Elastic, Low-Cost Data Processing in the Cloud


Published on

3 Things to Learn About:

*On-premises versus the cloud: What’s the same and what’s different?
*Benefits of data processing in the cloud
*Best practices and architectural considerations

Published in: Software
  • Be the first to comment

Data Engineering: Elastic, Low-Cost Data Processing in the Cloud

  1. 1. 1© Cloudera, Inc. All rights reserved. Data Engineering: Elastic, Low-Cost Data Processing in the Cloud David Tishgart | Product Marketing | Cloudera Kaushik Deka | CTO | Novantas
  2. 2. 2© Cloudera, Inc. All rights reserved. Three Core Enterprise Workload Patterns Process data, develop & serve predictive models Data Engineering & Data Science ELT, reporting, exploratory business intelligence Analytic Database Build data-driven applications to deliver real-time insights Operational Database Multi-Storage, Multi-Environment
  3. 3. 3© Cloudera, Inc. All rights reserved. Data Engineering in the Cloud Across industries, data engineering and data science are a natural fit for the cloud: ● Data growth: More data being created in the cloud ● Transient workloads: Development/test, exploration; batch ETL, model training and scoring ● Flexibility: Optimize infrastructure for the job; self-service for data engineers, data scientists ● Lower TCO: Do more with less
  4. 4. 4© Cloudera, Inc. All rights reserved. Cloudera’s Data Engineering Solution Familiar tools for data science & data integration Partner integrations Interactive search and immediate exploration Search Audit, lineage, encryption, key management, & policy lifecycles Navigator Easy deployment and flexible scaling Cloud Deployment Modern Real-time Analytics Engine Spark Large-scale ETL & batch processing engine Hive-on-Spark
  5. 5. 5© Cloudera, Inc. All rights reserved. Traditional Data Engineering for ETL • Unstructured Data • Structured Data • Social Data • Machine Data • IOT • Stream or batch • Choice of engine: Spark, Hive on Spark, Mapreduce • Meet SLAs: resource management, fast processing • Analytic engines • Real-time applications • External storage systems Any source and format Large scale data processing Batch or stream pipelines Ingest data sources Transform and combine data with SLAs Processed data consumed by...
  6. 6. 6© Cloudera, Inc. All rights reserved. Data Engineering for Machine Learning Workloads Raw Data - many sources - many formats - varying validity Validated ML Models End User Data Engineering Data Science Well-formated data Training, validation, and test data cleaning merging filtering model building model training hyper-param tuning pipeline execution production operation Data Engineering Consump- tion for analysis Ongoing Data Ingestion
  7. 7. 7© Cloudera, Inc. All rights reserved. Transience for flexibility, lower TCO and risk Unified platform, from ingest to insight and action Object Store Hybrid support for multiple environments STORE COMPUTE Requirements for Data Engineering Portability, flexibility, and an end-to-end enterprise platform
  8. 8. 8© Cloudera, Inc. All rights reserved. Benefits of Data Engineering with Cloudera Lower TCO and increased flexibility on a trusted enterprise data platform Increased Convenience On a Common PlatformLower cost Multi-cloud • Shop across providers: Amazon, Google, Microsoft Deliver On-Demand • Immediate access to large compute with fast cluster provisioning • Self-service for developers Optimize and Isolate • Tailor infrastructure for the job • Run different software versions • Enable more experimentation with less opportunity cost Build Complete Data Apps • Ingest, stream, process, explore analyze, model, and serve on the same platform • Shared data with object store integration • Cluster metadata persistence • Common compliance-ready security and governance frameworks Manage Costs • Transience for dev/test, ETL, and data science • Usage-based pricing • Spot instance support
  9. 9. 9© Cloudera, Inc. All rights reserved. Data Engineering in the Cloud Three Architectural Patterns to Optimize Price, Performance, Convenience Object Storage Batch Cluster Transient Batch (most flexible) Spin up clusters as needed. ● On-demand/spot instances ● Usage-based pricing ● Sized for workload ● Cluster per tenant/user Batch Cluster Batch Cluster Persistent Batch (most control) Persistent cluster(s) for frequent ETL. ● Reserved instances ● Node-based pricing ● Grow/shrink ● Cluster per tenant group Persistent Cluster Batch Persistent Batch on HDFS (fastest) Top performance for frequent ETL. ● Reserved instances ● Node-based pricing ● Grow/shrink ● Shared across tenant groups Batch Persistent Cluster Batch Batch Persistent Cluster HDFS Batch Batch
  10. 10. 10© Cloudera, Inc. All rights reserved. Data Engineering for Customer Journey Analytics and Scoring in Financial Services Kaushik Deka, CTO, Novantas
  11. 11. 11© Cloudera, Inc. All rights reserved. Novantas is the leader in customer science and revenue strategies for the financial industry through analytics that leverage data, advice and technology  2016 FinTech 100 company based out of Manhattan (NYC) providing Pricing, Distribution, Treasury/Risk and Marketing Solutions in retail banking  Expert Practice Leaders work with CEO’s and Functional Heads daily around the globe.  Decision support and analytic platforms help top 20 US banks manage over US$1.5 Trillion of deposits  Center of excellence in Big Data Analytics in Consumer Banking
  12. 12. 12© Cloudera, Inc. All rights reserved. Novantas has developed leading-edge analytics and modeling capabilities to help Banks improve their performance at all stages of the customer journey Novantas Supports Banks in Understanding All Stages of the Customer Journey Acquisition Activation and Engagement Maturity Senescence •Customer Segmentation •Customer Targeting •Channel Optimization •Offer Optimization •Activation Propensity •Customer Potential Value •Deposit Modelling •Promotional Optimization •Usage Optimization •Attrition Propensity •Retention Campaign Targeting •Customer Lifetime Value •Cross-sell and Upsell Modelling •Revenue Optimization •Primacy/Exclusivity Optimization
  13. 13. 13© Cloudera, Inc. All rights reserved. One of our unique contributions to customer journey analytics is in the area of customer scoring, particularly metrics to determine the potential value of their customers Sample Complexities Current Profitability Calculation of current value contribution of a customer • Differentiation and appropriate valuation of deposits (core, promotional) • Scope of calculation (eg current account only, bank only, full relationship) and, if less than full relationship, accounting for additional profits/cost generated elsewhere Over Time Account Potential Assessment of the value contribution of the customer longitudinally (over time) • Estimation of future account usage patterns (balances, transactional behavior, savings/borrowing requirements, etc) • Duration of calculation (lifetime, 10-year, 5-year, etc) Across Wallet Account Potential Assessment of the value contribution of the customer latitudinally (across wallet) • Estimation of current off-us wallet (balances and value to Bank) • Scope (e.g. checking only, bank only, full relationship) CLV Focus CPV Focus Core Elements of Customer Potential Value Calculation
  14. 14. 14© Cloudera, Inc. All rights reserved. The data engineering challenges that underpin the development of these scores are immense ILLUSTRATIVE SCORING CALCULATION COMPLEXITIES There are literally thousands of stratified variables that could potentially go into the calculations… Choosing these variables can also be affected by curated data available… Even when data is available, interpreting it is complicated and metric/model definitions can change… • Basic Variables (Average Daily Balance, Number of Deposits/Cycle, Average Non-Bill Transfer Out Value, etc) • First Order Derivatives (Rate of Change in Average Daily Balances, Rate of Change in Monthly Branch Transactions, etc) • Second Order Derivatives (Rate of Change in First Order Derivative Variables, eg Rate of Change in Rate of Change in Average Daily Balances) • etc • Understanding sources of useful data • Knowing how to parse difficult data into usable information • Knowing which data can be easily substituted for more readily available sources • Experience in mapping multiple data sources to a semantically integrated financial data model • Transactional clues: frequency of payment, consistency of amount, relation of payment amount to account balances, etc • Account clues: existence of mortgage at the bank, existence of credit card at the bank, presence of utility payments from account, presence of income payments, etc …Being able to rapidly perform feature engineering at scale on large data sets is essential given the large number of variables that must be evaluated …Being able to curate and map a wide range of data sources and types to a standard data model ensures data integrity and allows data scientists to spend more time on modeling rather than data wrangling …Being able to govern business metadata and track model performance is essential to the ongoing application of the metrics/scores $1,467.32 to Bank X Mortgage? Savings? Transfer to Secondary (Primary?) Account? Credit Card Payment? Your Statement
  15. 15. 15© Cloudera, Inc. All rights reserved. Our Customer Analytics Platform built on CDH 5.8 leverages Spark on YARN and is engineered for high performance analytics on both AWS and private cloud Ecosystem of Applications (Domain Specific) MetricScape Scoring Workbench Internal and External Data Sources Internal bank, 3rd Party (e.g., competitor pricing), public domain, and Novantas proprietary data Spark/Hadoop (CDH) Metadata Governance Metrics Library Management Novantas Banking Data Model Analytics Database BI/Reporting Scoring / Campaigns Scenario / Optimization Publish metadata to Navigator APIs for operationalizing predictive models Customer Data Hub (Hybrid Cloud on HDFS) Analytics Operating System (Spark on YARN) Ecosystem of fit-for-purpose End-User Apps Forecasting Rate /Offer Delivery Analytic Dataset Generation
  16. 16. 16© Cloudera, Inc. All rights reserved. A banking ontology stored in a containerized format on HDFS enables efficient data processing
  17. 17. 17© Cloudera, Inc. All rights reserved. Our MetricScape scoring workbench has built-in metadata governance and code-gen capability and leverages Spark, Navigator, Search, Hue and Impala Manage Metrics/Scores Library • Create/modify metric(s) definitions using Novantas Spark API for Banking • Capture metadata and publish to Navigator • Faceted search and tagging • Version Control and business traceability Manage Data Sources • Register use case specific data model • Data model is stored in container-based storage format in Hadoop for optimal processing at customer level Generate Analytic Datasets • Create dataset to support specific scoring use cases (segmentation, multi-point, event aligned, etc) Connect to BI Tool (Tableau) via Impala Connector Data Visualization Connect to Jupyter/RStudio Train/Test Models Connect to Hue/Impala Interactive Queries MetricScape
  18. 18. 18© Cloudera, Inc. All rights reserved. Data Life Cycle (Hybrid Cloud) Data Sources (HDFS or S3) Data source 1 Raw files Raw Maps Process/ derivations Settings/prefs Data source 2 Raw files Raw Maps Process/ derivations Settings/prefs Data source 3 Raw files Raw Maps Process/ derivations Settings/prefs Metrics and Model Factory (Spark) Logical Data Warehouse (HDFS/Parquet) Data-driven extraction pipeline MetricScape API Analytic Datasets Domain 1 Domain 3Domain 2 Standard Banking Data Model •Stable Entity Keys •Common Entities •Common Dimensions •Derived datasets from downstream processes Ontology Validation Model API end- points Metadata Catalog Use Case Driven Data Model Data Harmonization BI or Impala Search Navigator MetricScape
  19. 19. 19© Cloudera, Inc. All rights reserved. Technology catalysts of a data engineering solution for customer journey analytics and scoring use cases An efficient and cost-effective storage model on Hadoop that works on hybrid cloud and that co-partitions and co- locates related data conforming to a banking ontology A high performance domain specific Spark API onto the semantic data model leveraging the Spark ecosystem to parallelize metrics and models A data science workbench with built-in metadata governance and code-gen capability Curated library of parameterized metrics with data lineage that can be leveraged to score millions of customers A metadata governance and version control framework built into feature engineering and all analyses on the workbench, cataloged in Cloudera Navigator
  20. 20. 20© Cloudera, Inc. All rights reserved.20 Business case for a large US bank: optimize the role and value of promotional pricing to drive rate insensitive deposit growth using customer propensity modeling and scoring Challenge • What are the material segments of depositors that react to promotional pricing - when and why ? • At what point in the customer journey can the bank most economically influence deposit consolidation? Massive Dataset with Transactional Information Deep Analytics – Repeatable, reliable Scoring Models IT Benefits (Cloud Solution) Business Impact • 9 Years of Customer and Account Holdings Months Customer Holdings • 4 years of money in / money out detail • 2 years of offer disposition history • 3rd Party Data and Novantas Wallet models • Descriptive Metrics for Customer Journey Exploration • Scoring models identifying: • Price Sensitivity • Shopping Behavior • Deposit cost given churn • Persistence • CPV • Over 1000 Metrics/Scores per customer generated in 14ms • Low TCO (hybrid cloud) • Scalable infrastructure (eg. scale storage and compute separately) • Manage costs (eg. transient nodes for variable workloads) • Speed of provisioning cluster (eg. Cloudera Director) • Ease of administration and maintenance (eg. Cloudera Manager) • Reduce promotional spend by 50% through precision targeting in marketing treatments • Increase initiatives around achieving primacy • Limit retention offers – reduce promotion expense by 10% and balance retention by only 3% with nominal change in customer retention
  21. 21. 21© Cloudera, Inc. All rights reserved. Director Provisioning: Cluster Lifecycle Management Spin up, grow & shrink, terminate CDH clusters that read/write to object store Easy Administration • Dynamic cluster lifecycle management • Single pane of glass: multi-cluster view Flexible Deployments • Multi-cloud: AWS, Azure, GCP • Fast cluster deployments • Scaling of CDH clusters • Spot instance support Enterprise-grade • Integration across Cloudera Enterprise • Management of CDH deployments at scale Cloudera Director
  22. 22. 22© Cloudera, Inc. All rights reserved. Instance Recommendations Default Guidelines (based on Apache Spark best practices) Workload AWS Azure Google Default (mixed workloads) m4.2xlarge (or greater) D3-5 v2 n1-standard-4/8/16 Compute-Intensive (e.g. machine learning simulations) c4.2xlarge (or greater) F4, F8, F16 n1-highcpu-8/16/32 Memory-Intensive (e.g. large, cached Spark objects) r4.2xlarge (or greater) D11-15 v2 n1-highmem-4/8/16/32 I/O-Intensive (e.g. multiple or shared R/W steps) EBS-backed (see “exceptions”) Use Premium Storage Data Nodes: Master Nodes: Type AWS Azure Google CM m4.4xlarge DS13 v2 n1-standard-16 CDH c4.4xlarge DS14 v2 n1-highmem-16 Master Node Notes: ● Size memory inline with the cluster size. ● Do not use Spot. Spot Block is acceptable if reservation duration exceeds workload time. ● Start with 50GB block storage (gp2 for AWS) for CM.
  23. 23. 23© Cloudera, Inc. All rights reserved. Q&A
  24. 24. 24© Cloudera, Inc. All rights reserved. Next steps • Check out our other “Best practices” in the cloud webinars • Learn more about Novantas and companies like them • Visit our downloads page and take Cloudera Director for a spin
  25. 25. 25© Cloudera, Inc. All rights reserved. Thank you