SlideShare a Scribd company logo
1 of 19
BEST PRACTICES FOR
THE APACHE HADOOP
DATA WAREHOUSE
EDW 101 FOR HADOOP
PROFESSIONALS
RALPH KIMBALL / ELI COLLINS
MAY 2014
Best Practices for the Hadoop Data Warehouse
© Ralph Kimball, Cloudera, 2014
May 2014
The Enterprise Data Warehouse
Legacy
 More than 30 years, countless successful
installations, billions of dollars
 Fundamental architecture best practices
 Business user driven: simple, fast, relevant
 Best designs driven by actual data, not top down
models
 Enterprise entities: dimensions, facts, and primary
keys
 Time variance: slowly changing dimensions
 Integration: conformed dimensions
 These best practices also apply to Hadoop
systems
Expose the Data as
Dimensions and Facts
 Dimensions are the enterprise’s fundamental
entities
 Dimensions are a strategic asset
separate from any given data source
 Dimensions need to be attached to each source
 Measurement EVENTS are 1-to-1 with
Fact Table RECORDS
 The GRAIN of a fact table is the physical
world’s description of the measurement event
A Health Care Use Case
 Grain = Health Care Hospital
Events
Grain = Patient Event During Hospital Stay
Importing Raw Data into Hadoop
 Ingesting and transforming raw data from diverse
sources for analysis is where Hadoop shines
 What: Medical device data, doctors’ notes, nurse’s notes,
medications administered, procedures performed,
diagnoses, lab tests, X-rays, ultrasound exams, therapists’
reports, billing, ...
 From: Operational RDBMSs, enterprise data warehouse,
human entered logs, machine generated data files, special
systems, ...
 Use native ingest tools & 3rd party data integration
products
 Always retain original data in full fidelity
 Keep data files “as is” or use Hadoop native formats
 Opportunistically add data sources  Agile!
Importing Raw Data into Hadoop
 First step: get hospital procedures from billing
RDBMS, doctors notes from RDBMS, patient info
from DW, ...
 As well as X-rays from radiology system
$ sqoop import
--connect jdbc:oracle:thin:@db.server.com/BILLING
--table PROCEDURES
--target-dir /ingest/procedures/2014_05_29
$ hadoop fs –put /dcom_files/2014_05_29
hdfs://server.com/ingest/xrays/2014_05_29
$ sqoop import … /EMR … --table CLINICAL_NOTES
$ sqoop import … /CDR … --table PATIENT_INFO
Plan the Fact Table
 Third step: create queries on raw data that will be
basis for extracts from each source at the correct
grain
> CREATE EXTERNAL TABLE procedures_raw(
date_key bigint,
event timestamp, …)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’
LOCATION ‘/demo/procedures’;
 Second step: explore raw data immediately
before committing to physical data
transformations
Building the Fact Table
 Fourth step: Build up “native” table for facts using
special logic from extract queries created in step 3:
> CREATE TABLE hospital_events(…)
PARTITIONED BY date_key STORED AS PARQUET;
> INSERT INTO TABLE hospital_events
SELECT <special logic> FROM procedures_raw;
… SELECT <special logic> FROM patient_monitor_raw;
… SELECT <special logic> from clinical_notes_raw;
… SELECT <special logic> from device_17_raw;
… SELECT <special logic> from radiology_reports_raw;
… SELECT <special logic> from meds_adminstered_raw;
… and more
The Patient Dimension
 Primary key is a
“surrogate key”
 Durable identifier is
original “natural key”
 50 attributes typical
 Dimension is
instrumented for
episodic (slow)
changes
Manage Your Primary Keys
 “Natural” keys from source (often “un-natural”!)
 Poorly administered, overwritten, duplicated
 Awkward formats, implied semantic content
 Profoundly incompatible across data sources
 Replace or remap natural keys
 Enterprise dimension keys are surrogate keys
 Replace or remap in all dimension and fact tables
 Attach high value enterprise dimensions to every
source just by replacing the original natural keys
Inserting Surrogate Keys in
Facts
 Re-write fact tables with dimension SKs
NK
NK
NK
SK
SK
SK
NK
NK
NK
SKNK Join
Mapping tables
Original facts
SKNK
SKNK
SKNK
Insert
NK
NK
Append deltas
to facts and
mapping tables
Target Fact Table
Track Time Variance
 Dimensional entities change slowly and
episodically
 EDW has responsibility to correctly represent
history
 Must provide for multiple historically time
stamped versions of all dimension members
 SCDs: Slowly Changing Dimensions
 SCD Type 1: Overwrite dimension member, lose
history
 SCD Type 2: Add new time stamped dimension
member record, track history
Options for Implementing SCD 2
 Re-import the dimension table each time
 Or, import and merge the delta
 Or, re-build the table in Hadoop
 Implement complex merges with an integrated
ETL tool, or in SQL via Impala or Hive
$ sqoop import
--table patient_info
--incremental lastmodified
--check-column SCD2_EFFECTIVE_DATETIME
--last-value “2014-05-29 01:01:01”
Integrate Data Sources at the BI
Layer
 If the dimensions of two sources are not
“conformed” then the sources cannot be
integrated
 Two dimensions are conformed if they share
attributes (fields) that have the same domains
and same content
 The integration payload:
Conforming Dimensions in
Hadoop
 Goal: combine diverse data sets in a single
analysis
 Conform operational and analytical schemas
via key dimensions (user, product, geo)
 Build and use mapping tables (ala SK handling)
> CREATE TABLE patient_tmp LIKE patient_dim;
> ALTER TABLE patient_tmp ADD COLUMNS (state_conf int);
> INSERT INTO TABLE patient_tmp (SELECT … );
> DROP TABLE patient_dim;
> ALTER TABLE patient_tmp RENAME TO patient_dim;
tediou
s!
Integrate Data Sources at the BI
Layer
 Traditional data warehouse personas
 Dimension manager – responsible for defining and
publishing the conformed dimension content
 Fact provider – owner and publisher of fact table,
attached to conformed dimensions
 New Hadoop personas
 “Robot” dimension manager – using auto schema
inference, pattern matching, similarity matching, …
What’s Easy and What’s
Challenging in Hadoop as of May
2014
 Easy
 Assembling/investigating radically diverse data
sources
 Scaling out to any size at any velocity
 Somewhat challenging
 Building extract logic for each diverse data source
 Updating and appending to existing HDFS files
(requires rewrite – straightforward but slow)
 Generating surrogate keys in a profoundly
distributed environment
 Stay tuned! 
What Have We Accomplished
 Identified essential best practices from the EDW
world
 Business driven
 Dimensional approach
 Handling time variance with SCDs and surrogate
keys
 Integrating arbitrary sources with conformed
dimensions
 Shown examples of how to implement each best
practice in Hadoop
 Provided realistic assessment of current state of
The Kimball Group Resource
 www.kimballgroup.com
 Best selling data warehouse books
NEW BOOK! The Classic “Toolkit” 3rd Ed.
 In depth data warehouse classes
taught by primary authors
 Dimensional modeling (Ralph/Margy)
 ETL architecture (Ralph/Bob)
 Dimensional design reviews and consulting
by Kimball Group principals
 White Papers
on Integration, Data Quality, and Big Data Analytics

More Related Content

What's hot

OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)SANG WON PARK
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Use Cases for Elastic Search Percolator
Use Cases for Elastic Search PercolatorUse Cases for Elastic Search Percolator
Use Cases for Elastic Search PercolatorMaxim Shelest
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Spark Summit
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systemsDave Gardner
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 

What's hot (20)

Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Use Cases for Elastic Search Percolator
Use Cases for Elastic Search PercolatorUse Cases for Elastic Search Percolator
Use Cases for Elastic Search Percolator
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
 
Azure Synapse Analytics
Azure Synapse AnalyticsAzure Synapse Analytics
Azure Synapse Analytics
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Big data
Big dataBig data
Big data
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systems
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 

Viewers also liked

Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesCambridge Semantics
 
Building enterprise advance analytics platform
Building enterprise advance analytics platformBuilding enterprise advance analytics platform
Building enterprise advance analytics platformHaoran Du
 
Big data it’s impact on the finance function
Big data it’s impact on the finance functionBig data it’s impact on the finance function
Big data it’s impact on the finance functionMike Davis
 
Building A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopBuilding A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopCraig Warman
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business AnalyticsCleverDATA
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...NICSA
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 

Viewers also liked (9)

Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success Stories
 
Building enterprise advance analytics platform
Building enterprise advance analytics platformBuilding enterprise advance analytics platform
Building enterprise advance analytics platform
 
Big data it’s impact on the finance function
Big data it’s impact on the finance functionBig data it’s impact on the finance function
Big data it’s impact on the finance function
 
Building A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopBuilding A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on Hadoop
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business Analytics
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 

Similar to Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Impetus Technologies
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2DataWorks Summit
 
The 3 T's - Using Hadoop to modernize with faster access to data and value
The 3 T's - Using Hadoop to modernize with faster access to data and valueThe 3 T's - Using Hadoop to modernize with faster access to data and value
The 3 T's - Using Hadoop to modernize with faster access to data and valueDataWorks Summit
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...Cognizant
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRABhadra Gowdra
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
 
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...4Science
 
Best Practices and Lessons Learned on Our IBM Rational Insight Deployment
Best Practices and Lessons Learned on Our IBM Rational Insight DeploymentBest Practices and Lessons Learned on Our IBM Rational Insight Deployment
Best Practices and Lessons Learned on Our IBM Rational Insight DeploymentMarc Nehme
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAmazon Web Services
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data AnalyticsAttunity
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldCA Technologies
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014thiruvel
 

Similar to Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals (20)

Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2
 
The 3 T's - Using Hadoop to modernize with faster access to data and value
The 3 T's - Using Hadoop to modernize with faster access to data and valueThe 3 T's - Using Hadoop to modernize with faster access to data and value
The 3 T's - Using Hadoop to modernize with faster access to data and value
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
 
PRAFUL_HADOOP
PRAFUL_HADOOPPRAFUL_HADOOP
PRAFUL_HADOOP
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...
 
Best Practices and Lessons Learned on Our IBM Rational Insight Deployment
Best Practices and Lessons Learned on Our IBM Rational Insight DeploymentBest Practices and Lessons Learned on Our IBM Rational Insight Deployment
Best Practices and Lessons Learned on Our IBM Rational Insight Deployment
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
ITReady DW Day2
ITReady DW Day2ITReady DW Day2
ITReady DW Day2
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven World
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
 
Data-ware Housing
Data-ware HousingData-ware Housing
Data-ware Housing
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechProduct School
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)codyslingerland1
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud DataEric D. Schabell
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Alkin Tezuysal
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveIES VE
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)IES VE
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameKapil Thakar
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxKaustubhBhavsar6
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 

Recently uploaded (20)

AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First Frame
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptx
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

  • 1. BEST PRACTICES FOR THE APACHE HADOOP DATA WAREHOUSE EDW 101 FOR HADOOP PROFESSIONALS RALPH KIMBALL / ELI COLLINS MAY 2014 Best Practices for the Hadoop Data Warehouse © Ralph Kimball, Cloudera, 2014 May 2014
  • 2. The Enterprise Data Warehouse Legacy  More than 30 years, countless successful installations, billions of dollars  Fundamental architecture best practices  Business user driven: simple, fast, relevant  Best designs driven by actual data, not top down models  Enterprise entities: dimensions, facts, and primary keys  Time variance: slowly changing dimensions  Integration: conformed dimensions  These best practices also apply to Hadoop systems
  • 3. Expose the Data as Dimensions and Facts  Dimensions are the enterprise’s fundamental entities  Dimensions are a strategic asset separate from any given data source  Dimensions need to be attached to each source  Measurement EVENTS are 1-to-1 with Fact Table RECORDS  The GRAIN of a fact table is the physical world’s description of the measurement event
  • 4. A Health Care Use Case  Grain = Health Care Hospital Events Grain = Patient Event During Hospital Stay
  • 5. Importing Raw Data into Hadoop  Ingesting and transforming raw data from diverse sources for analysis is where Hadoop shines  What: Medical device data, doctors’ notes, nurse’s notes, medications administered, procedures performed, diagnoses, lab tests, X-rays, ultrasound exams, therapists’ reports, billing, ...  From: Operational RDBMSs, enterprise data warehouse, human entered logs, machine generated data files, special systems, ...  Use native ingest tools & 3rd party data integration products  Always retain original data in full fidelity  Keep data files “as is” or use Hadoop native formats  Opportunistically add data sources  Agile!
  • 6. Importing Raw Data into Hadoop  First step: get hospital procedures from billing RDBMS, doctors notes from RDBMS, patient info from DW, ...  As well as X-rays from radiology system $ sqoop import --connect jdbc:oracle:thin:@db.server.com/BILLING --table PROCEDURES --target-dir /ingest/procedures/2014_05_29 $ hadoop fs –put /dcom_files/2014_05_29 hdfs://server.com/ingest/xrays/2014_05_29 $ sqoop import … /EMR … --table CLINICAL_NOTES $ sqoop import … /CDR … --table PATIENT_INFO
  • 7. Plan the Fact Table  Third step: create queries on raw data that will be basis for extracts from each source at the correct grain > CREATE EXTERNAL TABLE procedures_raw( date_key bigint, event timestamp, …) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’ LOCATION ‘/demo/procedures’;  Second step: explore raw data immediately before committing to physical data transformations
  • 8. Building the Fact Table  Fourth step: Build up “native” table for facts using special logic from extract queries created in step 3: > CREATE TABLE hospital_events(…) PARTITIONED BY date_key STORED AS PARQUET; > INSERT INTO TABLE hospital_events SELECT <special logic> FROM procedures_raw; … SELECT <special logic> FROM patient_monitor_raw; … SELECT <special logic> from clinical_notes_raw; … SELECT <special logic> from device_17_raw; … SELECT <special logic> from radiology_reports_raw; … SELECT <special logic> from meds_adminstered_raw; … and more
  • 9. The Patient Dimension  Primary key is a “surrogate key”  Durable identifier is original “natural key”  50 attributes typical  Dimension is instrumented for episodic (slow) changes
  • 10. Manage Your Primary Keys  “Natural” keys from source (often “un-natural”!)  Poorly administered, overwritten, duplicated  Awkward formats, implied semantic content  Profoundly incompatible across data sources  Replace or remap natural keys  Enterprise dimension keys are surrogate keys  Replace or remap in all dimension and fact tables  Attach high value enterprise dimensions to every source just by replacing the original natural keys
  • 11. Inserting Surrogate Keys in Facts  Re-write fact tables with dimension SKs NK NK NK SK SK SK NK NK NK SKNK Join Mapping tables Original facts SKNK SKNK SKNK Insert NK NK Append deltas to facts and mapping tables Target Fact Table
  • 12. Track Time Variance  Dimensional entities change slowly and episodically  EDW has responsibility to correctly represent history  Must provide for multiple historically time stamped versions of all dimension members  SCDs: Slowly Changing Dimensions  SCD Type 1: Overwrite dimension member, lose history  SCD Type 2: Add new time stamped dimension member record, track history
  • 13. Options for Implementing SCD 2  Re-import the dimension table each time  Or, import and merge the delta  Or, re-build the table in Hadoop  Implement complex merges with an integrated ETL tool, or in SQL via Impala or Hive $ sqoop import --table patient_info --incremental lastmodified --check-column SCD2_EFFECTIVE_DATETIME --last-value “2014-05-29 01:01:01”
  • 14. Integrate Data Sources at the BI Layer  If the dimensions of two sources are not “conformed” then the sources cannot be integrated  Two dimensions are conformed if they share attributes (fields) that have the same domains and same content  The integration payload:
  • 15. Conforming Dimensions in Hadoop  Goal: combine diverse data sets in a single analysis  Conform operational and analytical schemas via key dimensions (user, product, geo)  Build and use mapping tables (ala SK handling) > CREATE TABLE patient_tmp LIKE patient_dim; > ALTER TABLE patient_tmp ADD COLUMNS (state_conf int); > INSERT INTO TABLE patient_tmp (SELECT … ); > DROP TABLE patient_dim; > ALTER TABLE patient_tmp RENAME TO patient_dim; tediou s!
  • 16. Integrate Data Sources at the BI Layer  Traditional data warehouse personas  Dimension manager – responsible for defining and publishing the conformed dimension content  Fact provider – owner and publisher of fact table, attached to conformed dimensions  New Hadoop personas  “Robot” dimension manager – using auto schema inference, pattern matching, similarity matching, …
  • 17. What’s Easy and What’s Challenging in Hadoop as of May 2014  Easy  Assembling/investigating radically diverse data sources  Scaling out to any size at any velocity  Somewhat challenging  Building extract logic for each diverse data source  Updating and appending to existing HDFS files (requires rewrite – straightforward but slow)  Generating surrogate keys in a profoundly distributed environment  Stay tuned! 
  • 18. What Have We Accomplished  Identified essential best practices from the EDW world  Business driven  Dimensional approach  Handling time variance with SCDs and surrogate keys  Integrating arbitrary sources with conformed dimensions  Shown examples of how to implement each best practice in Hadoop  Provided realistic assessment of current state of
  • 19. The Kimball Group Resource  www.kimballgroup.com  Best selling data warehouse books NEW BOOK! The Classic “Toolkit” 3rd Ed.  In depth data warehouse classes taught by primary authors  Dimensional modeling (Ralph/Margy)  ETL architecture (Ralph/Bob)  Dimensional design reviews and consulting by Kimball Group principals  White Papers on Integration, Data Quality, and Big Data Analytics