Presentation pdi data_vault_framework_meetup2012

Introductionn

eacweber@gmail.com

Data Vault Definition

The Data Vault is a detail oriented, historical tracking and uniquely
linked set of normalized tables that support one or more functional
areas of business. It is a hybrid approach encompassing the best
of breed between 3rd normal form (3NF) and star schema. The
design is flexible, scalable, consistent and adaptable to the needs
of the enterprise. It is a data model that is architected specifically
to meet the needs of enterprise data warehouses.

Source: Dan Linstedt
http://www.tdan.com/view-articles/5054/

Data Vault Building Blocks
different sources/rate of change

Source: Dan Linstedt
http://www.slideshare.net/dlinstedt/introduction-to-data-vault-dama-oregon-2012

Data Vault Fundamentals: Hub

Source: data-vault-modeling-guide
GENESEE ACADEMY, LLC, Hans Hultgren

Data Vault Fundamentals: Link


Data Vault Fundamentals: Satellite


Data Vault Fundamentals: Model


Data Vault ETL
Many objects to load, standardized procedures
This screams for a generic solution!
I don't want to:
throw ETL tool away and code it all myself
manage too many ETL objects
connect similar columns in mappings by hand
I do want to:
generate ETL (Kettle) objects? No
Take it one step further: there's only 1 parameterised hub load
object. Don't need to know xml structure of PDI objects

Tools
Operating System Database

Virtualization

Data Integration 'Productivity'

Sql Development
Version Control

Place of framework in architecture

Files

MySQL ETL:
Kettle MySQL
DBMS ETL Data Data ETL

Vault Vault
CSV Frame
Files work
ERP Staging Central DWH &
Area Data Marts

Sources ETL Process Data Warehouse EUL

What has to be taken care of?

Data Vault designed and implemented in database
Staging tables and loading procedures in place
(can also be generic, we use PDI Metadata Injection step for loading
files)

Mapping from source to Data Vault specified
(now in an Excel sheet)

Framework components

PDI repository (file based), jobs and transformations
Configuration files:
kettle.properties
shared.xml
repositories.xml

Excel sheet that contains the specifications
MySQL database for metadata
Virtual machine with Ubuntu 12.04 Server

Design decisions

Updateable views with generic column names
(MySQL more lenient than PostgreSQL)
Compare satellite attributes via string comparison
(concatenate all columns, with | (pipe) as delimiter)

'inject' the metadata using Kettle parameters
Generate and use an error table for each Data Vault
table

Metadata tables

All have history tables

Metadata in Excel
Data Vault

connections

source systems

source tables

Metadata in Excel (hub + sat)

x 200 (max)

Metadata in Excel (link)

x 10

link attributes

Metadata in Excel (link satellite)

x 10

x5

x 200 (max)

Last seen date

applicable for hubs and links
existing hubs and links: update 'last_seen_dts'!

Link validity satellite
Link has 'business key': not all hub id's

'design errors'

Checks to avoid debugging:
(compares design metadata with Data Vault DB information_schema)

hubs, links, satellites that don't exist in the DV
key columns that do not exist in the DV
missing connection data (source db)
missing attribute columns

Metadata needed for a hub

name
key column
business key column
source table
source table business key column
(can be expression, e.g. concatenate for composite key)

Metadata needed for a link
name
key column
for each hub (maximum 10, can be a ref-table)
hub name
column name for the hub key in the link (roles!)
column in the source table → business key of hub

link 'attributes' (part of key, no hub, maximum 5)
link validity satellite needed?
last seen date needed?
source table

Transformation for link

Last seen?
Lookup hubs

Remove columns not in link

Run table needed for
validity sat ?

Metadata needed for a hub satellite

name
key column
hub name
for each attribute (maximum 200)
source column
target column
source table

Transformation for hub satellite

Metadata needed for a link satellite

name
key column
link name
for each hub of the link:
for each key attribute: source column
for each attribute: source column → target column
source table

Transformation for link satellite

Logging

Custom logging

PDI logging

Configuring log tables
for concurrent access

Version Control: database objects

Some points of interest

Easy to make mistake in design sheet
Generic → a bit harder to maintain and debug
Application/tool to maintain metadata?
Data Vault generators (e.g. Quipu)?
Spinoff using Informatica and Oracle: Sander Robijns
Thanks to: Jos van Dongen
Kasper de Graaf

Presentation pdi data_vault_framework_meetup2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Presentation pdi data_vault_framework_meetup2012

Similar to Presentation pdi data_vault_framework_meetup2012 (20)

Recently uploaded

Recently uploaded (20)

Presentation pdi data_vault_framework_meetup2012