SlideShare a Scribd company logo
1 of 19
1© Cloudera, Inc. All rights reserved.
The Future of Data
Warehousing:
ETL Will Never be the Same
Ralph Kimball| Founder, Kimball Group
Manish Vipani| Vice President and Chief Architect,
Kaiser Permanente
2© Cloudera, Inc. All rights reserved.
Hadoop’s impact on data warehousing
• Traditional DBMS stack exploded into separate layers
• Data layer: HDFS files, not curated relational tables
• Metadata layer: open extensible HCatalog, not vendor system tables
• Query layer: cottage industry of query engines, not vendor specific SQL
• Schema on Read
• Allow the query layer to decide how to consume the data
• Materialize the view later (e.g., into Parquet files) for high performance
Integration goes far beyond relational tables
• Conformed dimensions remain the glue holding together Hadoop applications
(even if you have never heard of conformed dimensions!)
3© Cloudera, Inc. All rights reserved.
The logical architecture hasn’t changed
• Original Sources  ETL Step  Exposed Presentation Data  BI Application
• BUT, the physical architecture of the back room now looks very different
4© Cloudera, Inc. All rights reserved.
Old back room
• Slow transfer from sources
• Physical transformations required
• Cleaning, normalization required
• Mandated RDBMS table targets
• Metadata limited to system tables
• Presentation layer vendor mandated
• Single focus: RDBMS SQL only
New back room
• Purpose built for high transfer rates
• Physical transformations optional
• Cleaning, normalization discouraged
• Table targets optional or deferred
• Extensible metadata via HCatalog
• Presentation layer open ended
• Before or after any transformations
• Analytic client specific
• Multiple simultaneous personalities
The old and new back rooms
5© Cloudera, Inc. All rights reserved.
Old back room
• Off limits except to ETL staff
• “we aren’t ready”
• “the data must be cleaned”
• “data governance trumps”
• “end users not trusted”
• Traditional IT control
New back room
• Doors open to
• Qualified analytic users
• Automated processes
• Experiments, model building
• Clients other than SQL
• Open data marketplace
The biggest change to the back room
6© Cloudera, Inc. All rights reserved.
The Landing Zone
at Kaiser Permanente
Implementing the new ETL approach in the real world.
A unified data repository for secure and trusted data.
7© Cloudera, Inc. All rights reserved.
Landing Zone
Landing Zone – Home to secure and organized data
• A self service data platform hosting both the raw and prepared data sets for quick business
consumption to drive advanced business insights and decisions.
• Allow seamless data access for authorized users across enterprise business functions.
• Data is organized by domains/use cases in Raw and Refined zone.
• Perimeter security with data encrypted at rest.
• Kerberized with integration to identity and Access Management system.
Parts of Landing Zone
• Raw Zone -> Exact replica of source data.
• Refined Zone -> Transformed prepared data sets organized by use cases.
• User Defined Space -> Secure and common access to raw and trusted data.
• Master Data, Metadata, Internal Reference Data, Industry Reference Data, etc…
8© Cloudera, Inc. All rights reserved.
Landing Zone
SQL Java PIGHIVE
Replicate
Data Selection
Python
Source
Data
Exploratory Intelligence
A MRD
Analyze MineRefineDiscover
E
DW/DM
L
Data Extract
Role Based
Access Control
Perimeter
Security
Data Registry (Tags & Catalog)
Internal Reference Data
Meta Data
Industry Reference Data
HDFS
Master Data
Raw
Zone
User
Defined
Space
Refined
Zone
Usage Data
All Data Encrypted @ Rest
Access
Authentication
Data Load
Extract-
Load
Copy
Landing Zone – A Self Service Data Platform
hosting both the raw and prepared data sets for
quick business consumption.
 Data Security –
 Deployed on secured network with
traffic monitoring.
 Data is encrypted at rest.
 Role based access and authorization.
 Data Organization –
 Exact replica of source data organized by
information domains in Raw Zone.
 Data organized by use cases in the
Refined Zone (transformed prepared
data sets).
 Separate area allocated to track master
data, metadata, internal reference data
& industry specific reference data sets.
Impala
9© Cloudera, Inc. All rights reserved.
The ETL Revolution Poses
Significant Challenges
Some old, some new
10© Cloudera, Inc. All rights reserved.
Old challenges we’ve seen before
• Big data world is furiously implementing stovepipes
• Good news is the excitement of new data sources and analyses
• Bad news is ignoring integration, the fix is to start over
• New departments not seen with traditional data warehousing
• Not on anyone’s radar  rolling their own systems
• Unusual business user profiles, latency demands, security lapses
• Big speed bumps when replacing old systems with new
• Users don’t want to switch
• New results don’t match old results
• Legacy hardware and software absurdly expensive, doesn’t scale reasonably
11© Cloudera, Inc. All rights reserved.
New challenges needing inventive approaches
• Traditional BI decision makers joined by
• Data scientists
• Roll their own ETL, hardware, OSs, programming languages
• Take results to senior management directly
• Don’t stick around for documentation, rollout, user support, maintenance
• Predictive models and modelers
• Constantly changing schemas
• Tricky integration, e.g., joining relational tables to HBase
• Automatic daemons
• Enormous, bursty demand for computing resources
12© Cloudera, Inc. All rights reserved.
Kaiser Permanente’s Pragmatic
Response to the Challenges
Pain Points:
• Lack of user transient store and structural flexibility due to slow adaption to changes.
• Lack of ability to do analytics and hypothesis testing of new data from disparate systems.
Successes:
• Over 10+ proven use cases with some early adopters.
13© Cloudera, Inc. All rights reserved.
Landing Zone use cases
Problem
• Lack insight to understand factors influencing members’ adoption and utilization
of online services.
• Lack data integration and co-relation due to disparate systems.
• Lack 3600 member service utilization view and dashboards.
Resolution
• Summarized and aggregated data sets in landing zone helps in improved
decision making.
• Faster and complete access to data at scale for metrics reporting and analytics.
• Reduced data collection & metric reporting time from 3 weeks to 10 hours.
• Ease of building “decision-centric” dashboards (8 in 3 months).
Online Member Services – “kp.org”
14© Cloudera, Inc. All rights reserved.
Landing Zone use cases cont…
Problem
• Commercial large-scale data warehouse (Teradata) repository is expensive at scale, grows
exponentially, and processes large volumes of queries/month.
• Continuing workload tuning efforts are slow to yield expected results.
Resolution
• Replicate data from Teradata into Landing Zone.
• Rewrite and tune queries to eliminate semantically equivalent queries to achieve better
performance.
Moving Traditional Data Warehouse Workload to Landing Zone
Problem
• Lack of platform to collect and correlate structured and unstructured data from consumer facing
health monitoring devices e.g.: Fitbit, Glucometer, etc.
• Clinicians cannot track members’ health or weight goals, and see usage patterns.
Resolution
• Ingest transactional data and device logs into landing zone and create analytics workspace.
• Enable clinicians to generate aggregated data for tracking member adherence and build
dashboards using native tools.
Digital Services Dashboard – “Interchange”
15© Cloudera, Inc. All rights reserved.
Landing Zone use cases cont…
Problem
• Sequential and fragmented processes having limited ability to enrich data sources to increase
accuracy.
• Lack of clinical and analytical views increases lead time to analysis and inconsistent results.
Resolution
• Ingest data from fragmented system into the Landing Zone.
• Created program-wide clinical and analytical views with refresh speed to 7 hours from 18 hours.
Common Clinical and Analytical Views
Problem
• Current Medicare reporting solution does not maintain history and requires significant effort to
recreate prior reports and perform trend analysis.
• Externally hosted CIMP systems are cost-prohibitive and difficult to scale.
Resolution
• Replicate data from 30+ source systems into Landing Zone providing access to data internally.
• Rebuild reports with improved performance that runs within reasonable time at scale.
• Proved versatility of platform to handle data at scale and created equivalent reports.
Consumer Information Management Platform – CIMP 2.0
16© Cloudera, Inc. All rights reserved.
Architectural Wrap-Up
What does all this mean?
17© Cloudera, Inc. All rights reserved.
Kaiser Permanente is a work in progress
with impressive early results,
and insights for moving forward
• Be the single source of all Kaiser’s data as well as external data leveraged by Kaiser applications,
processes, and for Kaiser decision making.
• “Learn and adapt” model provides common capabilities across rich data set, with increased agility in
provisioning new data sets.
• Enabling data profiling / tagging, semantic search, descriptive, predictive and prescriptive analytics to
drive advanced business insights and decisions.
18© Cloudera, Inc. All rights reserved.
The Back Room Landing Zone has become
a Vibrant Marketplace
• Replaces the quiet ETL back room
• Challenging (exciting) new service role for IT
• Open for business
• Data scientists  A/B testing  experimentation  prototyping
• Simultaneous ETL pipelines  aggregates, high-performance Parquet files, uploads to EDW
• Simultaneous SQL and non-SQL clients
• Immediate access
• Don’t wait for physical transformation  schema-on-read
• Purpose built for extreme I/O performance
19© Cloudera, Inc. All rights reserved.
Thank you
Ralph Kimball, ralphcollector@gmail.com
Manish Vipani, manish.x.vipani@kp.org

More Related Content

What's hot

Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 

What's hot (20)

How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Data Vault and DW2.0
Data Vault and DW2.0Data Vault and DW2.0
Data Vault and DW2.0
 
Data Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and GovernanceData Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and Governance
 
Introduction to Data Vault Modeling
Introduction to Data Vault ModelingIntroduction to Data Vault Modeling
Introduction to Data Vault Modeling
 
RWDG Webinar: Data Steward Definition and Other Data Governance Roles
RWDG Webinar: Data Steward Definition and Other Data Governance RolesRWDG Webinar: Data Steward Definition and Other Data Governance Roles
RWDG Webinar: Data Steward Definition and Other Data Governance Roles
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 
data-mesh-101.pptx
data-mesh-101.pptxdata-mesh-101.pptx
data-mesh-101.pptx
 
The Path to Data and Analytics Modernization
The Path to Data and Analytics ModernizationThe Path to Data and Analytics Modernization
The Path to Data and Analytics Modernization
 
Successful Data Governance Models and Frameworks
Successful Data Governance Models and FrameworksSuccessful Data Governance Models and Frameworks
Successful Data Governance Models and Frameworks
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
The role of Big Data and Modern Data Management in Driving a Customer 360 fro...
The role of Big Data and Modern Data Management in Driving a Customer 360 fro...The role of Big Data and Modern Data Management in Driving a Customer 360 fro...
The role of Big Data and Modern Data Management in Driving a Customer 360 fro...
 
Rethinking Trust in Data
Rethinking Trust in Data Rethinking Trust in Data
Rethinking Trust in Data
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Data quality metrics infographic
Data quality metrics infographicData quality metrics infographic
Data quality metrics infographic
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Ibm data governance framework
Ibm data governance frameworkIbm data governance framework
Ibm data governance framework
 

Viewers also liked

Viewers also liked (20)

Creating a Modern Data Architecture
Creating a Modern Data ArchitectureCreating a Modern Data Architecture
Creating a Modern Data Architecture
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Hpc 4 5
Hpc 4 5Hpc 4 5
Hpc 4 5
 
Achieving Business Value by Fusing Hadoop and Corporate Data
Achieving Business Value by Fusing Hadoop and Corporate DataAchieving Business Value by Fusing Hadoop and Corporate Data
Achieving Business Value by Fusing Hadoop and Corporate Data
 
H0114857
H0114857H0114857
H0114857
 
To Study E T L ( Extract, Transform, Load) Tools Specially S Q L Server I...
To Study  E T L ( Extract, Transform, Load) Tools Specially  S Q L  Server  I...To Study  E T L ( Extract, Transform, Load) Tools Specially  S Q L  Server  I...
To Study E T L ( Extract, Transform, Load) Tools Specially S Q L Server I...
 
Ods, edf, eav & global types
Ods, edf, eav & global typesOds, edf, eav & global types
Ods, edf, eav & global types
 
SAP HORTONWORKS
SAP HORTONWORKSSAP HORTONWORKS
SAP HORTONWORKS
 
Computer science __engineering(4)
Computer science __engineering(4)Computer science __engineering(4)
Computer science __engineering(4)
 
Periyar msc
Periyar mscPeriyar msc
Periyar msc
 
White paper making an-operational_data_store_(ods)_the_center_of_your_data_...
White paper   making an-operational_data_store_(ods)_the_center_of_your_data_...White paper   making an-operational_data_store_(ods)_the_center_of_your_data_...
White paper making an-operational_data_store_(ods)_the_center_of_your_data_...
 
Apresentação ODS
Apresentação ODSApresentação ODS
Apresentação ODS
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data Warehouse Back to Basics: Dimensional Modeling
Data Warehouse Back to Basics: Dimensional ModelingData Warehouse Back to Basics: Dimensional Modeling
Data Warehouse Back to Basics: Dimensional Modeling
 
ETL QA
ETL QAETL QA
ETL QA
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl concepts
 
Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSAgile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
 

Similar to The Future of Data Warehousing: ETL Will Never be the Same

Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
Cloudera, Inc.
 
PayPal Decision Management Architecture
PayPal Decision Management ArchitecturePayPal Decision Management Architecture
PayPal Decision Management Architecture
Pradeep Ballal
 

Similar to The Future of Data Warehousing: ETL Will Never be the Same (20)

The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8
 
Data warehouseold
Data warehouseoldData warehouseold
Data warehouseold
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
 
Big Data/Cloudera from Excelerate Systems
Big Data/Cloudera from Excelerate SystemsBig Data/Cloudera from Excelerate Systems
Big Data/Cloudera from Excelerate Systems
 
Oracle Database Appliance X5-2
Oracle Database Appliance X5-2 Oracle Database Appliance X5-2
Oracle Database Appliance X5-2
 
Chapter 11 Enterprise Resource Planning System
Chapter 11 Enterprise Resource Planning SystemChapter 11 Enterprise Resource Planning System
Chapter 11 Enterprise Resource Planning System
 
PayPal Decision Management Architecture
PayPal Decision Management ArchitecturePayPal Decision Management Architecture
PayPal Decision Management Architecture
 

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 

Recently uploaded (20)

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 

The Future of Data Warehousing: ETL Will Never be the Same

  • 1. 1© Cloudera, Inc. All rights reserved. The Future of Data Warehousing: ETL Will Never be the Same Ralph Kimball| Founder, Kimball Group Manish Vipani| Vice President and Chief Architect, Kaiser Permanente
  • 2. 2© Cloudera, Inc. All rights reserved. Hadoop’s impact on data warehousing • Traditional DBMS stack exploded into separate layers • Data layer: HDFS files, not curated relational tables • Metadata layer: open extensible HCatalog, not vendor system tables • Query layer: cottage industry of query engines, not vendor specific SQL • Schema on Read • Allow the query layer to decide how to consume the data • Materialize the view later (e.g., into Parquet files) for high performance Integration goes far beyond relational tables • Conformed dimensions remain the glue holding together Hadoop applications (even if you have never heard of conformed dimensions!)
  • 3. 3© Cloudera, Inc. All rights reserved. The logical architecture hasn’t changed • Original Sources  ETL Step  Exposed Presentation Data  BI Application • BUT, the physical architecture of the back room now looks very different
  • 4. 4© Cloudera, Inc. All rights reserved. Old back room • Slow transfer from sources • Physical transformations required • Cleaning, normalization required • Mandated RDBMS table targets • Metadata limited to system tables • Presentation layer vendor mandated • Single focus: RDBMS SQL only New back room • Purpose built for high transfer rates • Physical transformations optional • Cleaning, normalization discouraged • Table targets optional or deferred • Extensible metadata via HCatalog • Presentation layer open ended • Before or after any transformations • Analytic client specific • Multiple simultaneous personalities The old and new back rooms
  • 5. 5© Cloudera, Inc. All rights reserved. Old back room • Off limits except to ETL staff • “we aren’t ready” • “the data must be cleaned” • “data governance trumps” • “end users not trusted” • Traditional IT control New back room • Doors open to • Qualified analytic users • Automated processes • Experiments, model building • Clients other than SQL • Open data marketplace The biggest change to the back room
  • 6. 6© Cloudera, Inc. All rights reserved. The Landing Zone at Kaiser Permanente Implementing the new ETL approach in the real world. A unified data repository for secure and trusted data.
  • 7. 7© Cloudera, Inc. All rights reserved. Landing Zone Landing Zone – Home to secure and organized data • A self service data platform hosting both the raw and prepared data sets for quick business consumption to drive advanced business insights and decisions. • Allow seamless data access for authorized users across enterprise business functions. • Data is organized by domains/use cases in Raw and Refined zone. • Perimeter security with data encrypted at rest. • Kerberized with integration to identity and Access Management system. Parts of Landing Zone • Raw Zone -> Exact replica of source data. • Refined Zone -> Transformed prepared data sets organized by use cases. • User Defined Space -> Secure and common access to raw and trusted data. • Master Data, Metadata, Internal Reference Data, Industry Reference Data, etc…
  • 8. 8© Cloudera, Inc. All rights reserved. Landing Zone SQL Java PIGHIVE Replicate Data Selection Python Source Data Exploratory Intelligence A MRD Analyze MineRefineDiscover E DW/DM L Data Extract Role Based Access Control Perimeter Security Data Registry (Tags & Catalog) Internal Reference Data Meta Data Industry Reference Data HDFS Master Data Raw Zone User Defined Space Refined Zone Usage Data All Data Encrypted @ Rest Access Authentication Data Load Extract- Load Copy Landing Zone – A Self Service Data Platform hosting both the raw and prepared data sets for quick business consumption.  Data Security –  Deployed on secured network with traffic monitoring.  Data is encrypted at rest.  Role based access and authorization.  Data Organization –  Exact replica of source data organized by information domains in Raw Zone.  Data organized by use cases in the Refined Zone (transformed prepared data sets).  Separate area allocated to track master data, metadata, internal reference data & industry specific reference data sets. Impala
  • 9. 9© Cloudera, Inc. All rights reserved. The ETL Revolution Poses Significant Challenges Some old, some new
  • 10. 10© Cloudera, Inc. All rights reserved. Old challenges we’ve seen before • Big data world is furiously implementing stovepipes • Good news is the excitement of new data sources and analyses • Bad news is ignoring integration, the fix is to start over • New departments not seen with traditional data warehousing • Not on anyone’s radar  rolling their own systems • Unusual business user profiles, latency demands, security lapses • Big speed bumps when replacing old systems with new • Users don’t want to switch • New results don’t match old results • Legacy hardware and software absurdly expensive, doesn’t scale reasonably
  • 11. 11© Cloudera, Inc. All rights reserved. New challenges needing inventive approaches • Traditional BI decision makers joined by • Data scientists • Roll their own ETL, hardware, OSs, programming languages • Take results to senior management directly • Don’t stick around for documentation, rollout, user support, maintenance • Predictive models and modelers • Constantly changing schemas • Tricky integration, e.g., joining relational tables to HBase • Automatic daemons • Enormous, bursty demand for computing resources
  • 12. 12© Cloudera, Inc. All rights reserved. Kaiser Permanente’s Pragmatic Response to the Challenges Pain Points: • Lack of user transient store and structural flexibility due to slow adaption to changes. • Lack of ability to do analytics and hypothesis testing of new data from disparate systems. Successes: • Over 10+ proven use cases with some early adopters.
  • 13. 13© Cloudera, Inc. All rights reserved. Landing Zone use cases Problem • Lack insight to understand factors influencing members’ adoption and utilization of online services. • Lack data integration and co-relation due to disparate systems. • Lack 3600 member service utilization view and dashboards. Resolution • Summarized and aggregated data sets in landing zone helps in improved decision making. • Faster and complete access to data at scale for metrics reporting and analytics. • Reduced data collection & metric reporting time from 3 weeks to 10 hours. • Ease of building “decision-centric” dashboards (8 in 3 months). Online Member Services – “kp.org”
  • 14. 14© Cloudera, Inc. All rights reserved. Landing Zone use cases cont… Problem • Commercial large-scale data warehouse (Teradata) repository is expensive at scale, grows exponentially, and processes large volumes of queries/month. • Continuing workload tuning efforts are slow to yield expected results. Resolution • Replicate data from Teradata into Landing Zone. • Rewrite and tune queries to eliminate semantically equivalent queries to achieve better performance. Moving Traditional Data Warehouse Workload to Landing Zone Problem • Lack of platform to collect and correlate structured and unstructured data from consumer facing health monitoring devices e.g.: Fitbit, Glucometer, etc. • Clinicians cannot track members’ health or weight goals, and see usage patterns. Resolution • Ingest transactional data and device logs into landing zone and create analytics workspace. • Enable clinicians to generate aggregated data for tracking member adherence and build dashboards using native tools. Digital Services Dashboard – “Interchange”
  • 15. 15© Cloudera, Inc. All rights reserved. Landing Zone use cases cont… Problem • Sequential and fragmented processes having limited ability to enrich data sources to increase accuracy. • Lack of clinical and analytical views increases lead time to analysis and inconsistent results. Resolution • Ingest data from fragmented system into the Landing Zone. • Created program-wide clinical and analytical views with refresh speed to 7 hours from 18 hours. Common Clinical and Analytical Views Problem • Current Medicare reporting solution does not maintain history and requires significant effort to recreate prior reports and perform trend analysis. • Externally hosted CIMP systems are cost-prohibitive and difficult to scale. Resolution • Replicate data from 30+ source systems into Landing Zone providing access to data internally. • Rebuild reports with improved performance that runs within reasonable time at scale. • Proved versatility of platform to handle data at scale and created equivalent reports. Consumer Information Management Platform – CIMP 2.0
  • 16. 16© Cloudera, Inc. All rights reserved. Architectural Wrap-Up What does all this mean?
  • 17. 17© Cloudera, Inc. All rights reserved. Kaiser Permanente is a work in progress with impressive early results, and insights for moving forward • Be the single source of all Kaiser’s data as well as external data leveraged by Kaiser applications, processes, and for Kaiser decision making. • “Learn and adapt” model provides common capabilities across rich data set, with increased agility in provisioning new data sets. • Enabling data profiling / tagging, semantic search, descriptive, predictive and prescriptive analytics to drive advanced business insights and decisions.
  • 18. 18© Cloudera, Inc. All rights reserved. The Back Room Landing Zone has become a Vibrant Marketplace • Replaces the quiet ETL back room • Challenging (exciting) new service role for IT • Open for business • Data scientists  A/B testing  experimentation  prototyping • Simultaneous ETL pipelines  aggregates, high-performance Parquet files, uploads to EDW • Simultaneous SQL and non-SQL clients • Immediate access • Don’t wait for physical transformation  schema-on-read • Purpose built for extreme I/O performance
  • 19. 19© Cloudera, Inc. All rights reserved. Thank you Ralph Kimball, ralphcollector@gmail.com Manish Vipani, manish.x.vipani@kp.org