2. Data Vault Definition
The Data Vault is a detail oriented, historical tracking and uniquely
linked set of normalized tables that support one or more functional
areas of business. It is a hybrid approach encompassing the best
of breed between 3rd normal form (3NF) and star schema. The
design is flexible, scalable, consistent and adaptable to the needs
of the enterprise. It is a data model that is architected specifically
to meet the needs of enterprise data warehouses.
Source: Dan Linstedt
http://www.tdan.com/view-articles/5054/
3. Data Vault Building Blocks
different sources/rate of change
Source: Dan Linstedt
http://www.slideshare.net/dlinstedt/introduction-to-data-vault-dama-oregon-2012
4. Data Vault Fundamentals: Hub
Source: data-vault-modeling-guide
GENESEE ACADEMY, LLC, Hans Hultgren
5. Data Vault Fundamentals: Link
Source: data-vault-modeling-guide
GENESEE ACADEMY, LLC, Hans Hultgren
6. Data Vault Fundamentals: Satellite
Source: data-vault-modeling-guide
GENESEE ACADEMY, LLC, Hans Hultgren
7. Data Vault Fundamentals: Model
Source: data-vault-modeling-guide
GENESEE ACADEMY, LLC, Hans Hultgren
8. Data Vault ETL
Many objects to load, standardized procedures
This screams for a generic solution!
I don't want to:
throw ETL tool away and code it all myself
manage too many ETL objects
connect similar columns in mappings by hand
I do want to:
generate ETL (Kettle) objects? No
Take it one step further: there's only 1 parameterised hub load
object. Don't need to know xml structure of PDI objects
9. Tools
Operating System Database
Virtualization
Data Integration 'Productivity'
Sql Development
Version Control
10. Place of framework in architecture
Files
MySQL ETL:
Kettle MySQL
DBMS ETL Data Data ETL
Vault Vault
CSV Frame
Files work
ERP Staging Central DWH &
Area Data Marts
Sources ETL Process Data Warehouse EUL
11. What has to be taken care of?
Data Vault designed and implemented in database
Staging tables and loading procedures in place
(can also be generic, we use PDI Metadata Injection step for loading
files)
Mapping from source to Data Vault specified
(now in an Excel sheet)
12. Framework components
PDI repository (file based), jobs and transformations
Configuration files:
kettle.properties
shared.xml
repositories.xml
Excel sheet that contains the specifications
MySQL database for metadata
Virtual machine with Ubuntu 12.04 Server
13. Design decisions
Updateable views with generic column names
(MySQL more lenient than PostgreSQL)
Compare satellite attributes via string comparison
(concatenate all columns, with | (pipe) as delimiter)
'inject' the metadata using Kettle parameters
Generate and use an error table for each Data Vault
table
22. 'design errors'
Checks to avoid debugging:
(compares design metadata with Data Vault DB information_schema)
hubs, links, satellites that don't exist in the DV
key columns that do not exist in the DV
missing connection data (source db)
missing attribute columns
24. Metadata needed for a hub
name
key column
business key column
source table
source table business key column
(can be expression, e.g. concatenate for composite key)
27. Metadata needed for a link
name
key column
for each hub (maximum 10, can be a ref-table)
hub name
column name for the hub key in the link (roles!)
column in the source table → business key of hub
link 'attributes' (part of key, no hub, maximum 5)
link validity satellite needed?
last seen date needed?
source table
29. Transformation for link
Last seen?
Lookup hubs
Remove columns not in link
Run table needed for
validity sat ?
30. Metadata needed for a hub satellite
name
key column
hub name
column in the source table → business key of hub
for each attribute (maximum 200)
source column
target column
source table
33. Metadata needed for a link satellite
name
key column
link name
for each hub of the link:
column in the source table → business key of hub
for each key attribute: source column
for each attribute: source column → target column
source table
41. Some points of interest
Easy to make mistake in design sheet
Generic → a bit harder to maintain and debug
Application/tool to maintain metadata?
Data Vault generators (e.g. Quipu)?
Spinoff using Informatica and Oracle: Sander Robijns
Thanks to: Jos van Dongen
Kasper de Graaf