Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals


Published on

The enormous legacy of EDW experience and best practices can be adapted to the unique capabilities of the Hadoop environment. In this webinar, in a point-counterpoint format, Dr. Kimball will describe standard data warehouse best practices including the identification of dimensions and facts, managing primary keys, and handling slowly changing dimensions (SCDs) and conformed dimensions. Eli Collins, Chief Technologist at Cloudera, will describe how each of these practices actually can be implemented in Hadoop.

Published in: Technology

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

  1. 1. BEST PRACTICES FOR THE APACHE HADOOP DATA WAREHOUSE EDW 101 FOR HADOOP PROFESSIONALS RALPH KIMBALL / ELI COLLINS MAY 2014 Best Practices for the Hadoop Data Warehouse © Ralph Kimball, Cloudera, 2014 May 2014
  2. 2. The Enterprise Data Warehouse Legacy  More than 30 years, countless successful installations, billions of dollars  Fundamental architecture best practices  Business user driven: simple, fast, relevant  Best designs driven by actual data, not top down models  Enterprise entities: dimensions, facts, and primary keys  Time variance: slowly changing dimensions  Integration: conformed dimensions  These best practices also apply to Hadoop systems
  3. 3. Expose the Data as Dimensions and Facts  Dimensions are the enterprise’s fundamental entities  Dimensions are a strategic asset separate from any given data source  Dimensions need to be attached to each source  Measurement EVENTS are 1-to-1 with Fact Table RECORDS  The GRAIN of a fact table is the physical world’s description of the measurement event
  4. 4. A Health Care Use Case  Grain = Health Care Hospital Events Grain = Patient Event During Hospital Stay
  5. 5. Importing Raw Data into Hadoop  Ingesting and transforming raw data from diverse sources for analysis is where Hadoop shines  What: Medical device data, doctors’ notes, nurse’s notes, medications administered, procedures performed, diagnoses, lab tests, X-rays, ultrasound exams, therapists’ reports, billing, ...  From: Operational RDBMSs, enterprise data warehouse, human entered logs, machine generated data files, special systems, ...  Use native ingest tools & 3rd party data integration products  Always retain original data in full fidelity  Keep data files “as is” or use Hadoop native formats  Opportunistically add data sources  Agile!
  6. 6. Importing Raw Data into Hadoop  First step: get hospital procedures from billing RDBMS, doctors notes from RDBMS, patient info from DW, ...  As well as X-rays from radiology system $ sqoop import --connect --table PROCEDURES --target-dir /ingest/procedures/2014_05_29 $ hadoop fs –put /dcom_files/2014_05_29 hdfs:// $ sqoop import … /EMR … --table CLINICAL_NOTES $ sqoop import … /CDR … --table PATIENT_INFO
  7. 7. Plan the Fact Table  Third step: create queries on raw data that will be basis for extracts from each source at the correct grain > CREATE EXTERNAL TABLE procedures_raw( date_key bigint, event timestamp, …) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’ LOCATION ‘/demo/procedures’;  Second step: explore raw data immediately before committing to physical data transformations
  8. 8. Building the Fact Table  Fourth step: Build up “native” table for facts using special logic from extract queries created in step 3: > CREATE TABLE hospital_events(…) PARTITIONED BY date_key STORED AS PARQUET; > INSERT INTO TABLE hospital_events SELECT <special logic> FROM procedures_raw; … SELECT <special logic> FROM patient_monitor_raw; … SELECT <special logic> from clinical_notes_raw; … SELECT <special logic> from device_17_raw; … SELECT <special logic> from radiology_reports_raw; … SELECT <special logic> from meds_adminstered_raw; … and more
  9. 9. The Patient Dimension  Primary key is a “surrogate key”  Durable identifier is original “natural key”  50 attributes typical  Dimension is instrumented for episodic (slow) changes
  10. 10. Manage Your Primary Keys  “Natural” keys from source (often “un-natural”!)  Poorly administered, overwritten, duplicated  Awkward formats, implied semantic content  Profoundly incompatible across data sources  Replace or remap natural keys  Enterprise dimension keys are surrogate keys  Replace or remap in all dimension and fact tables  Attach high value enterprise dimensions to every source just by replacing the original natural keys
  11. 11. Inserting Surrogate Keys in Facts  Re-write fact tables with dimension SKs NK NK NK SK SK SK NK NK NK SKNK Join Mapping tables Original facts SKNK SKNK SKNK Insert NK NK Append deltas to facts and mapping tables Target Fact Table
  12. 12. Track Time Variance  Dimensional entities change slowly and episodically  EDW has responsibility to correctly represent history  Must provide for multiple historically time stamped versions of all dimension members  SCDs: Slowly Changing Dimensions  SCD Type 1: Overwrite dimension member, lose history  SCD Type 2: Add new time stamped dimension member record, track history
  13. 13. Options for Implementing SCD 2  Re-import the dimension table each time  Or, import and merge the delta  Or, re-build the table in Hadoop  Implement complex merges with an integrated ETL tool, or in SQL via Impala or Hive $ sqoop import --table patient_info --incremental lastmodified --check-column SCD2_EFFECTIVE_DATETIME --last-value “2014-05-29 01:01:01”
  14. 14. Integrate Data Sources at the BI Layer  If the dimensions of two sources are not “conformed” then the sources cannot be integrated  Two dimensions are conformed if they share attributes (fields) that have the same domains and same content  The integration payload:
  15. 15. Conforming Dimensions in Hadoop  Goal: combine diverse data sets in a single analysis  Conform operational and analytical schemas via key dimensions (user, product, geo)  Build and use mapping tables (ala SK handling) > CREATE TABLE patient_tmp LIKE patient_dim; > ALTER TABLE patient_tmp ADD COLUMNS (state_conf int); > INSERT INTO TABLE patient_tmp (SELECT … ); > DROP TABLE patient_dim; > ALTER TABLE patient_tmp RENAME TO patient_dim; tediou s!
  16. 16. Integrate Data Sources at the BI Layer  Traditional data warehouse personas  Dimension manager – responsible for defining and publishing the conformed dimension content  Fact provider – owner and publisher of fact table, attached to conformed dimensions  New Hadoop personas  “Robot” dimension manager – using auto schema inference, pattern matching, similarity matching, …
  17. 17. What’s Easy and What’s Challenging in Hadoop as of May 2014  Easy  Assembling/investigating radically diverse data sources  Scaling out to any size at any velocity  Somewhat challenging  Building extract logic for each diverse data source  Updating and appending to existing HDFS files (requires rewrite – straightforward but slow)  Generating surrogate keys in a profoundly distributed environment  Stay tuned! 
  18. 18. What Have We Accomplished  Identified essential best practices from the EDW world  Business driven  Dimensional approach  Handling time variance with SCDs and surrogate keys  Integrating arbitrary sources with conformed dimensions  Shown examples of how to implement each best practice in Hadoop  Provided realistic assessment of current state of
  19. 19. The Kimball Group Resource   Best selling data warehouse books NEW BOOK! The Classic “Toolkit” 3rd Ed.  In depth data warehouse classes taught by primary authors  Dimensional modeling (Ralph/Margy)  ETL architecture (Ralph/Bob)  Dimensional design reviews and consulting by Kimball Group principals  White Papers on Integration, Data Quality, and Big Data Analytics