More Related Content Similar to The Future of Data Warehousing: ETL Will Never be the Same (20) More from Cloudera, Inc. (20) The Future of Data Warehousing: ETL Will Never be the Same1. 1© Cloudera, Inc. All rights reserved.
The Future of Data
Warehousing:
ETL Will Never be the Same
Ralph Kimball| Founder, Kimball Group
Manish Vipani| Vice President and Chief Architect,
Kaiser Permanente
2. 2© Cloudera, Inc. All rights reserved.
Hadoop’s impact on data warehousing
• Traditional DBMS stack exploded into separate layers
• Data layer: HDFS files, not curated relational tables
• Metadata layer: open extensible HCatalog, not vendor system tables
• Query layer: cottage industry of query engines, not vendor specific SQL
• Schema on Read
• Allow the query layer to decide how to consume the data
• Materialize the view later (e.g., into Parquet files) for high performance
Integration goes far beyond relational tables
• Conformed dimensions remain the glue holding together Hadoop applications
(even if you have never heard of conformed dimensions!)
3. 3© Cloudera, Inc. All rights reserved.
The logical architecture hasn’t changed
• Original Sources ETL Step Exposed Presentation Data BI Application
• BUT, the physical architecture of the back room now looks very different
4. 4© Cloudera, Inc. All rights reserved.
Old back room
• Slow transfer from sources
• Physical transformations required
• Cleaning, normalization required
• Mandated RDBMS table targets
• Metadata limited to system tables
• Presentation layer vendor mandated
• Single focus: RDBMS SQL only
New back room
• Purpose built for high transfer rates
• Physical transformations optional
• Cleaning, normalization discouraged
• Table targets optional or deferred
• Extensible metadata via HCatalog
• Presentation layer open ended
• Before or after any transformations
• Analytic client specific
• Multiple simultaneous personalities
The old and new back rooms
5. 5© Cloudera, Inc. All rights reserved.
Old back room
• Off limits except to ETL staff
• “we aren’t ready”
• “the data must be cleaned”
• “data governance trumps”
• “end users not trusted”
• Traditional IT control
New back room
• Doors open to
• Qualified analytic users
• Automated processes
• Experiments, model building
• Clients other than SQL
• Open data marketplace
The biggest change to the back room
6. 6© Cloudera, Inc. All rights reserved.
The Landing Zone
at Kaiser Permanente
Implementing the new ETL approach in the real world.
A unified data repository for secure and trusted data.
7. 7© Cloudera, Inc. All rights reserved.
Landing Zone
Landing Zone – Home to secure and organized data
• A self service data platform hosting both the raw and prepared data sets for quick business
consumption to drive advanced business insights and decisions.
• Allow seamless data access for authorized users across enterprise business functions.
• Data is organized by domains/use cases in Raw and Refined zone.
• Perimeter security with data encrypted at rest.
• Kerberized with integration to identity and Access Management system.
Parts of Landing Zone
• Raw Zone -> Exact replica of source data.
• Refined Zone -> Transformed prepared data sets organized by use cases.
• User Defined Space -> Secure and common access to raw and trusted data.
• Master Data, Metadata, Internal Reference Data, Industry Reference Data, etc…
8. 8© Cloudera, Inc. All rights reserved.
Landing Zone
SQL Java PIGHIVE
Replicate
Data Selection
Python
Source
Data
Exploratory Intelligence
A MRD
Analyze MineRefineDiscover
E
DW/DM
L
Data Extract
Role Based
Access Control
Perimeter
Security
Data Registry (Tags & Catalog)
Internal Reference Data
Meta Data
Industry Reference Data
HDFS
Master Data
Raw
Zone
User
Defined
Space
Refined
Zone
Usage Data
All Data Encrypted @ Rest
Access
Authentication
Data Load
Extract-
Load
Copy
Landing Zone – A Self Service Data Platform
hosting both the raw and prepared data sets for
quick business consumption.
Data Security –
Deployed on secured network with
traffic monitoring.
Data is encrypted at rest.
Role based access and authorization.
Data Organization –
Exact replica of source data organized by
information domains in Raw Zone.
Data organized by use cases in the
Refined Zone (transformed prepared
data sets).
Separate area allocated to track master
data, metadata, internal reference data
& industry specific reference data sets.
Impala
9. 9© Cloudera, Inc. All rights reserved.
The ETL Revolution Poses
Significant Challenges
Some old, some new
10. 10© Cloudera, Inc. All rights reserved.
Old challenges we’ve seen before
• Big data world is furiously implementing stovepipes
• Good news is the excitement of new data sources and analyses
• Bad news is ignoring integration, the fix is to start over
• New departments not seen with traditional data warehousing
• Not on anyone’s radar rolling their own systems
• Unusual business user profiles, latency demands, security lapses
• Big speed bumps when replacing old systems with new
• Users don’t want to switch
• New results don’t match old results
• Legacy hardware and software absurdly expensive, doesn’t scale reasonably
11. 11© Cloudera, Inc. All rights reserved.
New challenges needing inventive approaches
• Traditional BI decision makers joined by
• Data scientists
• Roll their own ETL, hardware, OSs, programming languages
• Take results to senior management directly
• Don’t stick around for documentation, rollout, user support, maintenance
• Predictive models and modelers
• Constantly changing schemas
• Tricky integration, e.g., joining relational tables to HBase
• Automatic daemons
• Enormous, bursty demand for computing resources
12. 12© Cloudera, Inc. All rights reserved.
Kaiser Permanente’s Pragmatic
Response to the Challenges
Pain Points:
• Lack of user transient store and structural flexibility due to slow adaption to changes.
• Lack of ability to do analytics and hypothesis testing of new data from disparate systems.
Successes:
• Over 10+ proven use cases with some early adopters.
13. 13© Cloudera, Inc. All rights reserved.
Landing Zone use cases
Problem
• Lack insight to understand factors influencing members’ adoption and utilization
of online services.
• Lack data integration and co-relation due to disparate systems.
• Lack 3600 member service utilization view and dashboards.
Resolution
• Summarized and aggregated data sets in landing zone helps in improved
decision making.
• Faster and complete access to data at scale for metrics reporting and analytics.
• Reduced data collection & metric reporting time from 3 weeks to 10 hours.
• Ease of building “decision-centric” dashboards (8 in 3 months).
Online Member Services – “kp.org”
14. 14© Cloudera, Inc. All rights reserved.
Landing Zone use cases cont…
Problem
• Commercial large-scale data warehouse (Teradata) repository is expensive at scale, grows
exponentially, and processes large volumes of queries/month.
• Continuing workload tuning efforts are slow to yield expected results.
Resolution
• Replicate data from Teradata into Landing Zone.
• Rewrite and tune queries to eliminate semantically equivalent queries to achieve better
performance.
Moving Traditional Data Warehouse Workload to Landing Zone
Problem
• Lack of platform to collect and correlate structured and unstructured data from consumer facing
health monitoring devices e.g.: Fitbit, Glucometer, etc.
• Clinicians cannot track members’ health or weight goals, and see usage patterns.
Resolution
• Ingest transactional data and device logs into landing zone and create analytics workspace.
• Enable clinicians to generate aggregated data for tracking member adherence and build
dashboards using native tools.
Digital Services Dashboard – “Interchange”
15. 15© Cloudera, Inc. All rights reserved.
Landing Zone use cases cont…
Problem
• Sequential and fragmented processes having limited ability to enrich data sources to increase
accuracy.
• Lack of clinical and analytical views increases lead time to analysis and inconsistent results.
Resolution
• Ingest data from fragmented system into the Landing Zone.
• Created program-wide clinical and analytical views with refresh speed to 7 hours from 18 hours.
Common Clinical and Analytical Views
Problem
• Current Medicare reporting solution does not maintain history and requires significant effort to
recreate prior reports and perform trend analysis.
• Externally hosted CIMP systems are cost-prohibitive and difficult to scale.
Resolution
• Replicate data from 30+ source systems into Landing Zone providing access to data internally.
• Rebuild reports with improved performance that runs within reasonable time at scale.
• Proved versatility of platform to handle data at scale and created equivalent reports.
Consumer Information Management Platform – CIMP 2.0
16. 16© Cloudera, Inc. All rights reserved.
Architectural Wrap-Up
What does all this mean?
17. 17© Cloudera, Inc. All rights reserved.
Kaiser Permanente is a work in progress
with impressive early results,
and insights for moving forward
• Be the single source of all Kaiser’s data as well as external data leveraged by Kaiser applications,
processes, and for Kaiser decision making.
• “Learn and adapt” model provides common capabilities across rich data set, with increased agility in
provisioning new data sets.
• Enabling data profiling / tagging, semantic search, descriptive, predictive and prescriptive analytics to
drive advanced business insights and decisions.
18. 18© Cloudera, Inc. All rights reserved.
The Back Room Landing Zone has become
a Vibrant Marketplace
• Replaces the quiet ETL back room
• Challenging (exciting) new service role for IT
• Open for business
• Data scientists A/B testing experimentation prototyping
• Simultaneous ETL pipelines aggregates, high-performance Parquet files, uploads to EDW
• Simultaneous SQL and non-SQL clients
• Immediate access
• Don’t wait for physical transformation schema-on-read
• Purpose built for extreme I/O performance
19. 19© Cloudera, Inc. All rights reserved.
Thank you
Ralph Kimball, ralphcollector@gmail.com
Manish Vipani, manish.x.vipani@kp.org