Data Vault 2.0: Using MD5 Hashes for 
Change Data Capture 
Kent Graziano 
Data Warrior LLC 
Twitter @KentGraziano
Data Vault Definition 
The Data Vault is a detail oriented, historical tracking 
and uniquely linked set of normalized tables that 
support one or more functional areas of business. 
It is a hybrid approach encompassing the best of 
breed between 3rd normal form (3NF) and star 
schema. The design is flexible, scalable, consistent 
and adaptable to the needs of the enterprise. 
Architected specifically to meet the needs 
of today’s enterprise data warehouses 
Dan Linstedt: Defining the Data Vault 
TDAN.com Article
Data Vault Time Line 
E.F. Codd invented 
relational modeling 
Chris Date and 
Hugh Darwen 
Maintained and 
Refined 
Modeling 
1976 Dr Peter Chen 
Created E-R 
Diagramming 
Mid 70’s AC Nielsen 
Popularized 
Dimension & Fact Terms 
1990 – Dan Linstedt 
Begins R&D on Data 
Vault Modeling 
1960 1970 1980 1990 2000 
Early 70’s Bill 
Inmon Began 
Discussing Data 
Warehousing 
Mid 60’s Dimension & Fact 
Modeling presented by 
General Mills and Dartmouth 
University 
Late 80’s – Barry 
Devlin and Dr Kimball 
Release “Business 
Data Warehouse” 
Mid 80’s Bill Inmon 
Popularizes Data 
Warehousing 
Mid – Late 80’s Dr Kimball 
Popularizes Star Schema 
2000 – Dan Linstedt 
releases first 5 
articles on Data Vault 
Modeling 
© LearnDataVault.com
2014 - Next Evolution
What’s New in DV2.0? 
 Modeling Structure Includes… 
● NoSQL, and Non-Relational DB systems, Hybrid Systems 
● Minor Structure Changes to support NoSQL 
 New ETL Implementation Standards 
● For true real-time support 
● For NoSQL support 
 New Architecture Standards 
● To include support for NoSQL data management systems 
 New Methodology Components 
● Including CMMI, Six Sigma, and TQM 
● Including Project Planning, Tracking, and Oversight 
● Agile Delivery Mechanisms 
● Standards, and templates for Projects 
© LearnDataVault.com
This model is fully 
compliant with Hadoop, 
needs NO changes to 
work properly. 
The Hash Keys can be 
used to join to Hadoop 
data sets. 
MD5 PK – replaces 
surrogate keys 
MD5DIFF – used for 
change detection 
Use of MD5 Hash in DV2.0 
© LearnDataVault.com
MD5-based Change Detection 
 Think Type 2 SCD 
 Old Way: 
● Compare column by column 
● Source value != Current value in DW table 
● 20 columns, then 20 compares 
 New Way: 
● Concatenate all columns to one string 
● Convert to one char(32) string with hash function 
● Compare to hashed value (MD5DIFF) in target table 
● Does not matter how many columns 
© Data Warrior LLC
What does it look like? 
 Encode using standard MD5 hash 
function 
● rawtohex(sys.utl_raw.cast_to_raw( 
dbms_obfuscation_toolkit.md5 (input_string => 
...) 
 Need to minimize chance of duplicates 
● 12||3||45 and 1||2||345 hash to same value 
● Need a separator between each 
● Also handles case of null values 
● Example: Col1||’^’||Col2||’^’||Col3 
© Data Warrior LLC
Other considerations 
 To generate most consistent string: standardize! 
 Convert data types 
 If 'NUMBER', 'NVARCHAR2', 'NVARCHAR', 
'NCHAR‘ 
● THEN 'TO_CHAR(' || column_name || ')‘ 
 If 'RAW‘ 
● THEN 'ENC_BASE64(' || column_name || ')‘ 
 If 'DATE‘ 
● THEN 'TO_CHAR(' || column_name || ', ''YYYY-MM-DD'')‘ 
 If LIKE 'TIME%‘ 
● THEN 'TO_CHAR(' || column_name || ', ''YYYY-MM-DD 
HH24:MI:SS'')' 
© Data Warrior LLC
Final Input String 
(UPPER(TRIM(T1.GENERICNAME)) 
||'^'|| 
UPPER(TRIM( 
TO_CHAR(T1.MED_STRNG_AMT))) 
||'^'|| 
UPPER(TRIM(T1.UOM_CD)) 
||'^'|| 
UPPER(TRIM(T1.MED_FORM_NM)) 
||'^') 
© Data Warrior LLC
So what? 
 MD5 hash is consistent cross-platform 
 Changes multi-column compares to a single 
column 
 All compares take the same time during load 
process 
 Can use with any DW architecture that requires 
change detections 
 Virtually no limit 
● Think Big Data/Hadoop/NoSQL 
 Can generate the input string automatically 
● But that is another talk! 
© Data Warrior LLC
Learn more about Data Vault 
www.LearnDataVault.com 
www.danlinstedt.com 
On YouTube: 
www.youtube.com/LearnDataVault 
On Facebook: 
www.facebook.com/learndatavault
Super Charge Your Data Warehouse 
Available on Amazon.com 
Soft Cover or Kindle Format 
Now also available in PDF at 
LearnDataVault.com
Contact Information 
Kent Graziano 
The Oracle Data Warrior 
Data Warrior LLC 
Kent.graziano@att.net 
On Twitter @KentGraziano 
Visit my blog at 
http://kentgraziano.com

Data Vault 2.0: Using MD5 Hashes for Change Data Capture

  • 1.
    Data Vault 2.0:Using MD5 Hashes for Change Data Capture Kent Graziano Data Warrior LLC Twitter @KentGraziano
  • 2.
    Data Vault Definition The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. Architected specifically to meet the needs of today’s enterprise data warehouses Dan Linstedt: Defining the Data Vault TDAN.com Article
  • 3.
    Data Vault TimeLine E.F. Codd invented relational modeling Chris Date and Hugh Darwen Maintained and Refined Modeling 1976 Dr Peter Chen Created E-R Diagramming Mid 70’s AC Nielsen Popularized Dimension & Fact Terms 1990 – Dan Linstedt Begins R&D on Data Vault Modeling 1960 1970 1980 1990 2000 Early 70’s Bill Inmon Began Discussing Data Warehousing Mid 60’s Dimension & Fact Modeling presented by General Mills and Dartmouth University Late 80’s – Barry Devlin and Dr Kimball Release “Business Data Warehouse” Mid 80’s Bill Inmon Popularizes Data Warehousing Mid – Late 80’s Dr Kimball Popularizes Star Schema 2000 – Dan Linstedt releases first 5 articles on Data Vault Modeling © LearnDataVault.com
  • 4.
    2014 - NextEvolution
  • 5.
    What’s New inDV2.0?  Modeling Structure Includes… ● NoSQL, and Non-Relational DB systems, Hybrid Systems ● Minor Structure Changes to support NoSQL  New ETL Implementation Standards ● For true real-time support ● For NoSQL support  New Architecture Standards ● To include support for NoSQL data management systems  New Methodology Components ● Including CMMI, Six Sigma, and TQM ● Including Project Planning, Tracking, and Oversight ● Agile Delivery Mechanisms ● Standards, and templates for Projects © LearnDataVault.com
  • 6.
    This model isfully compliant with Hadoop, needs NO changes to work properly. The Hash Keys can be used to join to Hadoop data sets. MD5 PK – replaces surrogate keys MD5DIFF – used for change detection Use of MD5 Hash in DV2.0 © LearnDataVault.com
  • 7.
    MD5-based Change Detection  Think Type 2 SCD  Old Way: ● Compare column by column ● Source value != Current value in DW table ● 20 columns, then 20 compares  New Way: ● Concatenate all columns to one string ● Convert to one char(32) string with hash function ● Compare to hashed value (MD5DIFF) in target table ● Does not matter how many columns © Data Warrior LLC
  • 8.
    What does itlook like?  Encode using standard MD5 hash function ● rawtohex(sys.utl_raw.cast_to_raw( dbms_obfuscation_toolkit.md5 (input_string => ...)  Need to minimize chance of duplicates ● 12||3||45 and 1||2||345 hash to same value ● Need a separator between each ● Also handles case of null values ● Example: Col1||’^’||Col2||’^’||Col3 © Data Warrior LLC
  • 9.
    Other considerations To generate most consistent string: standardize!  Convert data types  If 'NUMBER', 'NVARCHAR2', 'NVARCHAR', 'NCHAR‘ ● THEN 'TO_CHAR(' || column_name || ')‘  If 'RAW‘ ● THEN 'ENC_BASE64(' || column_name || ')‘  If 'DATE‘ ● THEN 'TO_CHAR(' || column_name || ', ''YYYY-MM-DD'')‘  If LIKE 'TIME%‘ ● THEN 'TO_CHAR(' || column_name || ', ''YYYY-MM-DD HH24:MI:SS'')' © Data Warrior LLC
  • 10.
    Final Input String (UPPER(TRIM(T1.GENERICNAME)) ||'^'|| UPPER(TRIM( TO_CHAR(T1.MED_STRNG_AMT))) ||'^'|| UPPER(TRIM(T1.UOM_CD)) ||'^'|| UPPER(TRIM(T1.MED_FORM_NM)) ||'^') © Data Warrior LLC
  • 11.
    So what? MD5 hash is consistent cross-platform  Changes multi-column compares to a single column  All compares take the same time during load process  Can use with any DW architecture that requires change detections  Virtually no limit ● Think Big Data/Hadoop/NoSQL  Can generate the input string automatically ● But that is another talk! © Data Warrior LLC
  • 12.
    Learn more aboutData Vault www.LearnDataVault.com www.danlinstedt.com On YouTube: www.youtube.com/LearnDataVault On Facebook: www.facebook.com/learndatavault
  • 13.
    Super Charge YourData Warehouse Available on Amazon.com Soft Cover or Kindle Format Now also available in PDF at LearnDataVault.com
  • 14.
    Contact Information KentGraziano The Oracle Data Warrior Data Warrior LLC Kent.graziano@att.net On Twitter @KentGraziano Visit my blog at http://kentgraziano.com