1© Cloudera, Inc. All rights reserved.
The Future of Data
Warehousing:
ETL Will Never be the Same
Ralph Kimball| Founder, Kimball Group
Manish Vipani| Vice President and Chief Architect,
Kaiser Permanente
2© Cloudera, Inc. All rights reserved.
Hadoop’s impact on data warehousing
• Traditional DBMS stack exploded into separate layers
• Data layer: HDFS files, not curated relational tables
• Metadata layer: open extensible HCatalog, not vendor system tables
• Query layer: cottage industry of query engines, not vendor specific SQL
• Schema on Read
• Allow the query layer to decide how to consume the data
• Materialize the view later (e.g., into Parquet files) for high performance
Integration goes far beyond relational tables
• Conformed dimensions remain the glue holding together Hadoop applications
(even if you have never heard of conformed dimensions!)
3© Cloudera, Inc. All rights reserved.
The logical architecture hasn’t changed
• Original Sources  ETL Step  Exposed Presentation Data  BI Application
• BUT, the physical architecture of the back room now looks very different
4© Cloudera, Inc. All rights reserved.
Old back room
• Slow transfer from sources
• Physical transformations required
• Cleaning, normalization required
• Mandated RDBMS table targets
• Metadata limited to system tables
• Presentation layer vendor mandated
• Single focus: RDBMS SQL only
New back room
• Purpose built for high transfer rates
• Physical transformations optional
• Cleaning, normalization discouraged
• Table targets optional or deferred
• Extensible metadata via HCatalog
• Presentation layer open ended
• Before or after any transformations
• Analytic client specific
• Multiple simultaneous personalities
The old and new back rooms
5© Cloudera, Inc. All rights reserved.
Old back room
• Off limits except to ETL staff
• “we aren’t ready”
• “the data must be cleaned”
• “data governance trumps”
• “end users not trusted”
• Traditional IT control
New back room
• Doors open to
• Qualified analytic users
• Automated processes
• Experiments, model building
• Clients other than SQL
• Open data marketplace
The biggest change to the back room
6© Cloudera, Inc. All rights reserved.
The Landing Zone
at Kaiser Permanente
Implementing the new ETL approach in the real world.
A unified data repository for secure and trusted data.
7© Cloudera, Inc. All rights reserved.
Landing Zone
Landing Zone – Home to secure and organized data
• A self service data platform hosting both the raw and prepared data sets for quick business
consumption to drive advanced business insights and decisions.
• Allow seamless data access for authorized users across enterprise business functions.
• Data is organized by domains/use cases in Raw and Refined zone.
• Perimeter security with data encrypted at rest.
• Kerberized with integration to identity and Access Management system.
Parts of Landing Zone
• Raw Zone -> Exact replica of source data.
• Refined Zone -> Transformed prepared data sets organized by use cases.
• User Defined Space -> Secure and common access to raw and trusted data.
• Master Data, Metadata, Internal Reference Data, Industry Reference Data, etc…
8© Cloudera, Inc. All rights reserved.
Landing Zone
SQL Java PIGHIVE
Replicate
Data Selection
Python
Source
Data
Exploratory Intelligence
A MRD
Analyze MineRefineDiscover
E
DW/DM
L
Data Extract
Role Based
Access Control
Perimeter
Security
Data Registry (Tags & Catalog)
Internal Reference Data
Meta Data
Industry Reference Data
HDFS
Master Data
Raw
Zone
User
Defined
Space
Refined
Zone
Usage Data
All Data Encrypted @ Rest
Access
Authentication
Data Load
Extract-
Load
Copy
Landing Zone – A Self Service Data Platform
hosting both the raw and prepared data sets for
quick business consumption.
 Data Security –
 Deployed on secured network with
traffic monitoring.
 Data is encrypted at rest.
 Role based access and authorization.
 Data Organization –
 Exact replica of source data organized by
information domains in Raw Zone.
 Data organized by use cases in the
Refined Zone (transformed prepared
data sets).
 Separate area allocated to track master
data, metadata, internal reference data
& industry specific reference data sets.
Impala
9© Cloudera, Inc. All rights reserved.
The ETL Revolution Poses
Significant Challenges
Some old, some new
10© Cloudera, Inc. All rights reserved.
Old challenges we’ve seen before
• Big data world is furiously implementing stovepipes
• Good news is the excitement of new data sources and analyses
• Bad news is ignoring integration, the fix is to start over
• New departments not seen with traditional data warehousing
• Not on anyone’s radar  rolling their own systems
• Unusual business user profiles, latency demands, security lapses
• Big speed bumps when replacing old systems with new
• Users don’t want to switch
• New results don’t match old results
• Legacy hardware and software absurdly expensive, doesn’t scale reasonably
11© Cloudera, Inc. All rights reserved.
New challenges needing inventive approaches
• Traditional BI decision makers joined by
• Data scientists
• Roll their own ETL, hardware, OSs, programming languages
• Take results to senior management directly
• Don’t stick around for documentation, rollout, user support, maintenance
• Predictive models and modelers
• Constantly changing schemas
• Tricky integration, e.g., joining relational tables to HBase
• Automatic daemons
• Enormous, bursty demand for computing resources
12© Cloudera, Inc. All rights reserved.
Kaiser Permanente’s Pragmatic
Response to the Challenges
Pain Points:
• Lack of user transient store and structural flexibility due to slow adaption to changes.
• Lack of ability to do analytics and hypothesis testing of new data from disparate systems.
Successes:
• Over 10+ proven use cases with some early adopters.
13© Cloudera, Inc. All rights reserved.
Landing Zone use cases
Problem
• Lack insight to understand factors influencing members’ adoption and utilization
of online services.
• Lack data integration and co-relation due to disparate systems.
• Lack 3600 member service utilization view and dashboards.
Resolution
• Summarized and aggregated data sets in landing zone helps in improved
decision making.
• Faster and complete access to data at scale for metrics reporting and analytics.
• Reduced data collection & metric reporting time from 3 weeks to 10 hours.
• Ease of building “decision-centric” dashboards (8 in 3 months).
Online Member Services – “kp.org”
14© Cloudera, Inc. All rights reserved.
Landing Zone use cases cont…
Problem
• Commercial large-scale data warehouse (Teradata) repository is expensive at scale, grows
exponentially, and processes large volumes of queries/month.
• Continuing workload tuning efforts are slow to yield expected results.
Resolution
• Replicate data from Teradata into Landing Zone.
• Rewrite and tune queries to eliminate semantically equivalent queries to achieve better
performance.
Moving Traditional Data Warehouse Workload to Landing Zone
Problem
• Lack of platform to collect and correlate structured and unstructured data from consumer facing
health monitoring devices e.g.: Fitbit, Glucometer, etc.
• Clinicians cannot track members’ health or weight goals, and see usage patterns.
Resolution
• Ingest transactional data and device logs into landing zone and create analytics workspace.
• Enable clinicians to generate aggregated data for tracking member adherence and build
dashboards using native tools.
Digital Services Dashboard – “Interchange”
15© Cloudera, Inc. All rights reserved.
Landing Zone use cases cont…
Problem
• Sequential and fragmented processes having limited ability to enrich data sources to increase
accuracy.
• Lack of clinical and analytical views increases lead time to analysis and inconsistent results.
Resolution
• Ingest data from fragmented system into the Landing Zone.
• Created program-wide clinical and analytical views with refresh speed to 7 hours from 18 hours.
Common Clinical and Analytical Views
Problem
• Current Medicare reporting solution does not maintain history and requires significant effort to
recreate prior reports and perform trend analysis.
• Externally hosted CIMP systems are cost-prohibitive and difficult to scale.
Resolution
• Replicate data from 30+ source systems into Landing Zone providing access to data internally.
• Rebuild reports with improved performance that runs within reasonable time at scale.
• Proved versatility of platform to handle data at scale and created equivalent reports.
Consumer Information Management Platform – CIMP 2.0
16© Cloudera, Inc. All rights reserved.
Architectural Wrap-Up
What does all this mean?
17© Cloudera, Inc. All rights reserved.
Kaiser Permanente is a work in progress
with impressive early results,
and insights for moving forward
• Be the single source of all Kaiser’s data as well as external data leveraged by Kaiser applications,
processes, and for Kaiser decision making.
• “Learn and adapt” model provides common capabilities across rich data set, with increased agility in
provisioning new data sets.
• Enabling data profiling / tagging, semantic search, descriptive, predictive and prescriptive analytics to
drive advanced business insights and decisions.
18© Cloudera, Inc. All rights reserved.
The Back Room Landing Zone has become
a Vibrant Marketplace
• Replaces the quiet ETL back room
• Challenging (exciting) new service role for IT
• Open for business
• Data scientists  A/B testing  experimentation  prototyping
• Simultaneous ETL pipelines  aggregates, high-performance Parquet files, uploads to EDW
• Simultaneous SQL and non-SQL clients
• Immediate access
• Don’t wait for physical transformation  schema-on-read
• Purpose built for extreme I/O performance
19© Cloudera, Inc. All rights reserved.
Thank you
Ralph Kimball, ralphcollector@gmail.com
Manish Vipani, manish.x.vipani@kp.org

The Future of Data Warehousing: ETL Will Never be the Same

  • 1.
    1© Cloudera, Inc.All rights reserved. The Future of Data Warehousing: ETL Will Never be the Same Ralph Kimball| Founder, Kimball Group Manish Vipani| Vice President and Chief Architect, Kaiser Permanente
  • 2.
    2© Cloudera, Inc.All rights reserved. Hadoop’s impact on data warehousing • Traditional DBMS stack exploded into separate layers • Data layer: HDFS files, not curated relational tables • Metadata layer: open extensible HCatalog, not vendor system tables • Query layer: cottage industry of query engines, not vendor specific SQL • Schema on Read • Allow the query layer to decide how to consume the data • Materialize the view later (e.g., into Parquet files) for high performance Integration goes far beyond relational tables • Conformed dimensions remain the glue holding together Hadoop applications (even if you have never heard of conformed dimensions!)
  • 3.
    3© Cloudera, Inc.All rights reserved. The logical architecture hasn’t changed • Original Sources  ETL Step  Exposed Presentation Data  BI Application • BUT, the physical architecture of the back room now looks very different
  • 4.
    4© Cloudera, Inc.All rights reserved. Old back room • Slow transfer from sources • Physical transformations required • Cleaning, normalization required • Mandated RDBMS table targets • Metadata limited to system tables • Presentation layer vendor mandated • Single focus: RDBMS SQL only New back room • Purpose built for high transfer rates • Physical transformations optional • Cleaning, normalization discouraged • Table targets optional or deferred • Extensible metadata via HCatalog • Presentation layer open ended • Before or after any transformations • Analytic client specific • Multiple simultaneous personalities The old and new back rooms
  • 5.
    5© Cloudera, Inc.All rights reserved. Old back room • Off limits except to ETL staff • “we aren’t ready” • “the data must be cleaned” • “data governance trumps” • “end users not trusted” • Traditional IT control New back room • Doors open to • Qualified analytic users • Automated processes • Experiments, model building • Clients other than SQL • Open data marketplace The biggest change to the back room
  • 6.
    6© Cloudera, Inc.All rights reserved. The Landing Zone at Kaiser Permanente Implementing the new ETL approach in the real world. A unified data repository for secure and trusted data.
  • 7.
    7© Cloudera, Inc.All rights reserved. Landing Zone Landing Zone – Home to secure and organized data • A self service data platform hosting both the raw and prepared data sets for quick business consumption to drive advanced business insights and decisions. • Allow seamless data access for authorized users across enterprise business functions. • Data is organized by domains/use cases in Raw and Refined zone. • Perimeter security with data encrypted at rest. • Kerberized with integration to identity and Access Management system. Parts of Landing Zone • Raw Zone -> Exact replica of source data. • Refined Zone -> Transformed prepared data sets organized by use cases. • User Defined Space -> Secure and common access to raw and trusted data. • Master Data, Metadata, Internal Reference Data, Industry Reference Data, etc…
  • 8.
    8© Cloudera, Inc.All rights reserved. Landing Zone SQL Java PIGHIVE Replicate Data Selection Python Source Data Exploratory Intelligence A MRD Analyze MineRefineDiscover E DW/DM L Data Extract Role Based Access Control Perimeter Security Data Registry (Tags & Catalog) Internal Reference Data Meta Data Industry Reference Data HDFS Master Data Raw Zone User Defined Space Refined Zone Usage Data All Data Encrypted @ Rest Access Authentication Data Load Extract- Load Copy Landing Zone – A Self Service Data Platform hosting both the raw and prepared data sets for quick business consumption.  Data Security –  Deployed on secured network with traffic monitoring.  Data is encrypted at rest.  Role based access and authorization.  Data Organization –  Exact replica of source data organized by information domains in Raw Zone.  Data organized by use cases in the Refined Zone (transformed prepared data sets).  Separate area allocated to track master data, metadata, internal reference data & industry specific reference data sets. Impala
  • 9.
    9© Cloudera, Inc.All rights reserved. The ETL Revolution Poses Significant Challenges Some old, some new
  • 10.
    10© Cloudera, Inc.All rights reserved. Old challenges we’ve seen before • Big data world is furiously implementing stovepipes • Good news is the excitement of new data sources and analyses • Bad news is ignoring integration, the fix is to start over • New departments not seen with traditional data warehousing • Not on anyone’s radar  rolling their own systems • Unusual business user profiles, latency demands, security lapses • Big speed bumps when replacing old systems with new • Users don’t want to switch • New results don’t match old results • Legacy hardware and software absurdly expensive, doesn’t scale reasonably
  • 11.
    11© Cloudera, Inc.All rights reserved. New challenges needing inventive approaches • Traditional BI decision makers joined by • Data scientists • Roll their own ETL, hardware, OSs, programming languages • Take results to senior management directly • Don’t stick around for documentation, rollout, user support, maintenance • Predictive models and modelers • Constantly changing schemas • Tricky integration, e.g., joining relational tables to HBase • Automatic daemons • Enormous, bursty demand for computing resources
  • 12.
    12© Cloudera, Inc.All rights reserved. Kaiser Permanente’s Pragmatic Response to the Challenges Pain Points: • Lack of user transient store and structural flexibility due to slow adaption to changes. • Lack of ability to do analytics and hypothesis testing of new data from disparate systems. Successes: • Over 10+ proven use cases with some early adopters.
  • 13.
    13© Cloudera, Inc.All rights reserved. Landing Zone use cases Problem • Lack insight to understand factors influencing members’ adoption and utilization of online services. • Lack data integration and co-relation due to disparate systems. • Lack 3600 member service utilization view and dashboards. Resolution • Summarized and aggregated data sets in landing zone helps in improved decision making. • Faster and complete access to data at scale for metrics reporting and analytics. • Reduced data collection & metric reporting time from 3 weeks to 10 hours. • Ease of building “decision-centric” dashboards (8 in 3 months). Online Member Services – “kp.org”
  • 14.
    14© Cloudera, Inc.All rights reserved. Landing Zone use cases cont… Problem • Commercial large-scale data warehouse (Teradata) repository is expensive at scale, grows exponentially, and processes large volumes of queries/month. • Continuing workload tuning efforts are slow to yield expected results. Resolution • Replicate data from Teradata into Landing Zone. • Rewrite and tune queries to eliminate semantically equivalent queries to achieve better performance. Moving Traditional Data Warehouse Workload to Landing Zone Problem • Lack of platform to collect and correlate structured and unstructured data from consumer facing health monitoring devices e.g.: Fitbit, Glucometer, etc. • Clinicians cannot track members’ health or weight goals, and see usage patterns. Resolution • Ingest transactional data and device logs into landing zone and create analytics workspace. • Enable clinicians to generate aggregated data for tracking member adherence and build dashboards using native tools. Digital Services Dashboard – “Interchange”
  • 15.
    15© Cloudera, Inc.All rights reserved. Landing Zone use cases cont… Problem • Sequential and fragmented processes having limited ability to enrich data sources to increase accuracy. • Lack of clinical and analytical views increases lead time to analysis and inconsistent results. Resolution • Ingest data from fragmented system into the Landing Zone. • Created program-wide clinical and analytical views with refresh speed to 7 hours from 18 hours. Common Clinical and Analytical Views Problem • Current Medicare reporting solution does not maintain history and requires significant effort to recreate prior reports and perform trend analysis. • Externally hosted CIMP systems are cost-prohibitive and difficult to scale. Resolution • Replicate data from 30+ source systems into Landing Zone providing access to data internally. • Rebuild reports with improved performance that runs within reasonable time at scale. • Proved versatility of platform to handle data at scale and created equivalent reports. Consumer Information Management Platform – CIMP 2.0
  • 16.
    16© Cloudera, Inc.All rights reserved. Architectural Wrap-Up What does all this mean?
  • 17.
    17© Cloudera, Inc.All rights reserved. Kaiser Permanente is a work in progress with impressive early results, and insights for moving forward • Be the single source of all Kaiser’s data as well as external data leveraged by Kaiser applications, processes, and for Kaiser decision making. • “Learn and adapt” model provides common capabilities across rich data set, with increased agility in provisioning new data sets. • Enabling data profiling / tagging, semantic search, descriptive, predictive and prescriptive analytics to drive advanced business insights and decisions.
  • 18.
    18© Cloudera, Inc.All rights reserved. The Back Room Landing Zone has become a Vibrant Marketplace • Replaces the quiet ETL back room • Challenging (exciting) new service role for IT • Open for business • Data scientists  A/B testing  experimentation  prototyping • Simultaneous ETL pipelines  aggregates, high-performance Parquet files, uploads to EDW • Simultaneous SQL and non-SQL clients • Immediate access • Don’t wait for physical transformation  schema-on-read • Purpose built for extreme I/O performance
  • 19.
    19© Cloudera, Inc.All rights reserved. Thank you Ralph Kimball, ralphcollector@gmail.com Manish Vipani, manish.x.vipani@kp.org