Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Vault 2.0: Using MD5 Hashes for Change Data Capture


Published on

This presentation was given at OakTable World 2014 (#OTW14) in San Francisco as a short Ted-style 10 minute talk. In it I introduce Data Vault 2.0 and its innovative approach to doing change data capture in a data warehouse by using MD5 Hash columns.

Published in: Data & Analytics

Data Vault 2.0: Using MD5 Hashes for Change Data Capture

  1. 1. Data Vault 2.0: Using MD5 Hashes for Change Data Capture Kent Graziano Data Warrior LLC Twitter @KentGraziano
  2. 2. Data Vault Definition The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. Architected specifically to meet the needs of today’s enterprise data warehouses Dan Linstedt: Defining the Data Vault Article
  3. 3. Data Vault Time Line E.F. Codd invented relational modeling Chris Date and Hugh Darwen Maintained and Refined Modeling 1976 Dr Peter Chen Created E-R Diagramming Mid 70’s AC Nielsen Popularized Dimension & Fact Terms 1990 – Dan Linstedt Begins R&D on Data Vault Modeling 1960 1970 1980 1990 2000 Early 70’s Bill Inmon Began Discussing Data Warehousing Mid 60’s Dimension & Fact Modeling presented by General Mills and Dartmouth University Late 80’s – Barry Devlin and Dr Kimball Release “Business Data Warehouse” Mid 80’s Bill Inmon Popularizes Data Warehousing Mid – Late 80’s Dr Kimball Popularizes Star Schema 2000 – Dan Linstedt releases first 5 articles on Data Vault Modeling ©
  4. 4. 2014 - Next Evolution
  5. 5. What’s New in DV2.0?  Modeling Structure Includes… ● NoSQL, and Non-Relational DB systems, Hybrid Systems ● Minor Structure Changes to support NoSQL  New ETL Implementation Standards ● For true real-time support ● For NoSQL support  New Architecture Standards ● To include support for NoSQL data management systems  New Methodology Components ● Including CMMI, Six Sigma, and TQM ● Including Project Planning, Tracking, and Oversight ● Agile Delivery Mechanisms ● Standards, and templates for Projects ©
  6. 6. This model is fully compliant with Hadoop, needs NO changes to work properly. The Hash Keys can be used to join to Hadoop data sets. MD5 PK – replaces surrogate keys MD5DIFF – used for change detection Use of MD5 Hash in DV2.0 ©
  7. 7. MD5-based Change Detection  Think Type 2 SCD  Old Way: ● Compare column by column ● Source value != Current value in DW table ● 20 columns, then 20 compares  New Way: ● Concatenate all columns to one string ● Convert to one char(32) string with hash function ● Compare to hashed value (MD5DIFF) in target table ● Does not matter how many columns © Data Warrior LLC
  8. 8. What does it look like?  Encode using standard MD5 hash function ● rawtohex(sys.utl_raw.cast_to_raw( dbms_obfuscation_toolkit.md5 (input_string => ...)  Need to minimize chance of duplicates ● 12||3||45 and 1||2||345 hash to same value ● Need a separator between each ● Also handles case of null values ● Example: Col1||’^’||Col2||’^’||Col3 © Data Warrior LLC
  9. 9. Other considerations  To generate most consistent string: standardize!  Convert data types  If 'NUMBER', 'NVARCHAR2', 'NVARCHAR', 'NCHAR‘ ● THEN 'TO_CHAR(' || column_name || ')‘  If 'RAW‘ ● THEN 'ENC_BASE64(' || column_name || ')‘  If 'DATE‘ ● THEN 'TO_CHAR(' || column_name || ', ''YYYY-MM-DD'')‘  If LIKE 'TIME%‘ ● THEN 'TO_CHAR(' || column_name || ', ''YYYY-MM-DD HH24:MI:SS'')' © Data Warrior LLC
  10. 10. Final Input String (UPPER(TRIM(T1.GENERICNAME)) ||'^'|| UPPER(TRIM( TO_CHAR(T1.MED_STRNG_AMT))) ||'^'|| UPPER(TRIM(T1.UOM_CD)) ||'^'|| UPPER(TRIM(T1.MED_FORM_NM)) ||'^') © Data Warrior LLC
  11. 11. So what?  MD5 hash is consistent cross-platform  Changes multi-column compares to a single column  All compares take the same time during load process  Can use with any DW architecture that requires change detections  Virtually no limit ● Think Big Data/Hadoop/NoSQL  Can generate the input string automatically ● But that is another talk! © Data Warrior LLC
  12. 12. Learn more about Data Vault On YouTube: On Facebook:
  13. 13. Super Charge Your Data Warehouse Available on Soft Cover or Kindle Format Now also available in PDF at
  14. 14. Contact Information Kent Graziano The Oracle Data Warrior Data Warrior LLC On Twitter @KentGraziano Visit my blog at