BEST PRACTICES FOR
THE APACHE HADOOP
DATA WAREHOUSE
EDW 101 FOR HADOOP
PROFESSIONALS
RALPH KIMBALL / ELI COLLINS
MAY 2014
...
The Enterprise Data Warehouse
Legacy
 More than 30 years, countless successful
installations, billions of dollars
 Funda...
Expose the Data as
Dimensions and Facts
 Dimensions are the enterprise’s fundamental
entities
 Dimensions are a strategi...
A Health Care Use Case
 Grain = Health Care Hospital
Events
Grain = Patient Event During Hospital Stay
Importing Raw Data into Hadoop
 Ingesting and transforming raw data from diverse
sources for analysis is where Hadoop shi...
Importing Raw Data into Hadoop
 First step: get hospital procedures from billing
RDBMS, doctors notes from RDBMS, patient...
Plan the Fact Table
 Third step: create queries on raw data that will be
basis for extracts from each source at the corre...
Building the Fact Table
 Fourth step: Build up “native” table for facts using
special logic from extract queries created ...
The Patient Dimension
 Primary key is a
“surrogate key”
 Durable identifier is
original “natural key”
 50 attributes ty...
Manage Your Primary Keys
 “Natural” keys from source (often “un-natural”!)
 Poorly administered, overwritten, duplicated...
Inserting Surrogate Keys in
Facts
 Re-write fact tables with dimension SKs
NK
NK
NK
SK
SK
SK
NK
NK
NK
SKNK Join
Mapping t...
Track Time Variance
 Dimensional entities change slowly and
episodically
 EDW has responsibility to correctly represent
...
Options for Implementing SCD 2
 Re-import the dimension table each time
 Or, import and merge the delta
 Or, re-build t...
Integrate Data Sources at the BI
Layer
 If the dimensions of two sources are not
“conformed” then the sources cannot be
i...
Conforming Dimensions in
Hadoop
 Goal: combine diverse data sets in a single
analysis
 Conform operational and analytica...
Integrate Data Sources at the BI
Layer
 Traditional data warehouse personas
 Dimension manager – responsible for definin...
What’s Easy and What’s
Challenging in Hadoop as of May
2014
 Easy
 Assembling/investigating radically diverse data
sourc...
What Have We Accomplished
 Identified essential best practices from the EDW
world
 Business driven
 Dimensional approac...
The Kimball Group Resource
 www.kimballgroup.com
 Best selling data warehouse books
NEW BOOK! The Classic “Toolkit” 3rd ...
Upcoming SlideShare
Loading in...5
×

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

14,062

Published on

The enormous legacy of EDW experience and best practices can be adapted to the unique capabilities of the Hadoop environment. In this webinar, in a point-counterpoint format, Dr. Kimball will describe standard data warehouse best practices including the identification of dimensions and facts, managing primary keys, and handling slowly changing dimensions (SCDs) and conformed dimensions. Eli Collins, Chief Technologist at Cloudera, will describe how each of these practices actually can be implemented in Hadoop.

Published in: Technology

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

  1. 1. BEST PRACTICES FOR THE APACHE HADOOP DATA WAREHOUSE EDW 101 FOR HADOOP PROFESSIONALS RALPH KIMBALL / ELI COLLINS MAY 2014 Best Practices for the Hadoop Data Warehouse © Ralph Kimball, Cloudera, 2014 May 2014
  2. 2. The Enterprise Data Warehouse Legacy  More than 30 years, countless successful installations, billions of dollars  Fundamental architecture best practices  Business user driven: simple, fast, relevant  Best designs driven by actual data, not top down models  Enterprise entities: dimensions, facts, and primary keys  Time variance: slowly changing dimensions  Integration: conformed dimensions  These best practices also apply to Hadoop systems
  3. 3. Expose the Data as Dimensions and Facts  Dimensions are the enterprise’s fundamental entities  Dimensions are a strategic asset separate from any given data source  Dimensions need to be attached to each source  Measurement EVENTS are 1-to-1 with Fact Table RECORDS  The GRAIN of a fact table is the physical world’s description of the measurement event
  4. 4. A Health Care Use Case  Grain = Health Care Hospital Events Grain = Patient Event During Hospital Stay
  5. 5. Importing Raw Data into Hadoop  Ingesting and transforming raw data from diverse sources for analysis is where Hadoop shines  What: Medical device data, doctors’ notes, nurse’s notes, medications administered, procedures performed, diagnoses, lab tests, X-rays, ultrasound exams, therapists’ reports, billing, ...  From: Operational RDBMSs, enterprise data warehouse, human entered logs, machine generated data files, special systems, ...  Use native ingest tools & 3rd party data integration products  Always retain original data in full fidelity  Keep data files “as is” or use Hadoop native formats  Opportunistically add data sources  Agile!
  6. 6. Importing Raw Data into Hadoop  First step: get hospital procedures from billing RDBMS, doctors notes from RDBMS, patient info from DW, ...  As well as X-rays from radiology system $ sqoop import --connect jdbc:oracle:thin:@db.server.com/BILLING --table PROCEDURES --target-dir /ingest/procedures/2014_05_29 $ hadoop fs –put /dcom_files/2014_05_29 hdfs://server.com/ingest/xrays/2014_05_29 $ sqoop import … /EMR … --table CLINICAL_NOTES $ sqoop import … /CDR … --table PATIENT_INFO
  7. 7. Plan the Fact Table  Third step: create queries on raw data that will be basis for extracts from each source at the correct grain > CREATE EXTERNAL TABLE procedures_raw( date_key bigint, event timestamp, …) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’ LOCATION ‘/demo/procedures’;  Second step: explore raw data immediately before committing to physical data transformations
  8. 8. Building the Fact Table  Fourth step: Build up “native” table for facts using special logic from extract queries created in step 3: > CREATE TABLE hospital_events(…) PARTITIONED BY date_key STORED AS PARQUET; > INSERT INTO TABLE hospital_events SELECT <special logic> FROM procedures_raw; … SELECT <special logic> FROM patient_monitor_raw; … SELECT <special logic> from clinical_notes_raw; … SELECT <special logic> from device_17_raw; … SELECT <special logic> from radiology_reports_raw; … SELECT <special logic> from meds_adminstered_raw; … and more
  9. 9. The Patient Dimension  Primary key is a “surrogate key”  Durable identifier is original “natural key”  50 attributes typical  Dimension is instrumented for episodic (slow) changes
  10. 10. Manage Your Primary Keys  “Natural” keys from source (often “un-natural”!)  Poorly administered, overwritten, duplicated  Awkward formats, implied semantic content  Profoundly incompatible across data sources  Replace or remap natural keys  Enterprise dimension keys are surrogate keys  Replace or remap in all dimension and fact tables  Attach high value enterprise dimensions to every source just by replacing the original natural keys
  11. 11. Inserting Surrogate Keys in Facts  Re-write fact tables with dimension SKs NK NK NK SK SK SK NK NK NK SKNK Join Mapping tables Original facts SKNK SKNK SKNK Insert NK NK Append deltas to facts and mapping tables Target Fact Table
  12. 12. Track Time Variance  Dimensional entities change slowly and episodically  EDW has responsibility to correctly represent history  Must provide for multiple historically time stamped versions of all dimension members  SCDs: Slowly Changing Dimensions  SCD Type 1: Overwrite dimension member, lose history  SCD Type 2: Add new time stamped dimension member record, track history
  13. 13. Options for Implementing SCD 2  Re-import the dimension table each time  Or, import and merge the delta  Or, re-build the table in Hadoop  Implement complex merges with an integrated ETL tool, or in SQL via Impala or Hive $ sqoop import --table patient_info --incremental lastmodified --check-column SCD2_EFFECTIVE_DATETIME --last-value “2014-05-29 01:01:01”
  14. 14. Integrate Data Sources at the BI Layer  If the dimensions of two sources are not “conformed” then the sources cannot be integrated  Two dimensions are conformed if they share attributes (fields) that have the same domains and same content  The integration payload:
  15. 15. Conforming Dimensions in Hadoop  Goal: combine diverse data sets in a single analysis  Conform operational and analytical schemas via key dimensions (user, product, geo)  Build and use mapping tables (ala SK handling) > CREATE TABLE patient_tmp LIKE patient_dim; > ALTER TABLE patient_tmp ADD COLUMNS (state_conf int); > INSERT INTO TABLE patient_tmp (SELECT … ); > DROP TABLE patient_dim; > ALTER TABLE patient_tmp RENAME TO patient_dim; tediou s!
  16. 16. Integrate Data Sources at the BI Layer  Traditional data warehouse personas  Dimension manager – responsible for defining and publishing the conformed dimension content  Fact provider – owner and publisher of fact table, attached to conformed dimensions  New Hadoop personas  “Robot” dimension manager – using auto schema inference, pattern matching, similarity matching, …
  17. 17. What’s Easy and What’s Challenging in Hadoop as of May 2014  Easy  Assembling/investigating radically diverse data sources  Scaling out to any size at any velocity  Somewhat challenging  Building extract logic for each diverse data source  Updating and appending to existing HDFS files (requires rewrite – straightforward but slow)  Generating surrogate keys in a profoundly distributed environment  Stay tuned! 
  18. 18. What Have We Accomplished  Identified essential best practices from the EDW world  Business driven  Dimensional approach  Handling time variance with SCDs and surrogate keys  Integrating arbitrary sources with conformed dimensions  Shown examples of how to implement each best practice in Hadoop  Provided realistic assessment of current state of
  19. 19. The Kimball Group Resource  www.kimballgroup.com  Best selling data warehouse books NEW BOOK! The Classic “Toolkit” 3rd Ed.  In depth data warehouse classes taught by primary authors  Dimensional modeling (Ralph/Margy)  ETL architecture (Ralph/Bob)  Dimensional design reviews and consulting by Kimball Group principals  White Papers on Integration, Data Quality, and Big Data Analytics

×