SlideShare a Scribd company logo
1 of 29
Introduction

                    Edwin Weber
                   Weber Solutions
                eacweber@gmail.com


Back end of Data Warehousing
MySQL, SQL Server, Oracle, PostgreSQL
PDI, SSIS, Oracle Warehouse Builder (long ago)
Project

Sint Antonius hospital in Utrecht, Open Source oriented
Chance to combine Kettle experience with a Data Vault
  (new to me)
Practically at the same time: project SSIS and Data Vault
So I jumped on the Data Vault bandwagon
Data Vault ETL


    Many objects to load, standardized procedures

    This screams for a generic solution

    I don't want to:
     
         manage too many Kettle objects
     
         connect similar columns in mappings by hand

    Solution:
     
         Generate Kettle objects?
     
         Or take it one step further, there's only 1 parameterised
         hub load object. Don't need to know xml structure of PDI
                                                                     3
         objects.
Goal


    Generic ETL to load a Data Vault

    Metadata driven

    No generation, 1 object for each Data Vault entity
    
        Hub
    
        Link
    
        Hub satellite
    
        Link satellite
    
        Define the mappings, create the Data Vault tables: done!
                                                                   4
Tools


    Ubuntu

    Pentaho Data Integration CE

    LibreOffice Calc

    MySQL 5.1

    Cookbook, doc generation by Roland Bouman

    (PostgreSQL 9.0, Oracle 11)



                                                5
Deliverables


    Set of PDI jobs and transformations

    Configuration files:
kettle.properties
shared.xml
repositories.xml

    Excel sheet that contains the specifications

    Scripts to generate/populate the pdi_meta and
    data_vault databases (or schemas)

                                                    6
Design decisions


    Updateable views with generic column names

    (MySQL more lenient than PostgreSQL)

    Compare satellite attributes via string comparison
    (concatenate all columns, with | (pipe) as delimiter)

    'inject' the metadata using Kettle parameters

    Generate and use an error table for each Data Vault
    table. Kettle handles the errors. Helps to find DV
    design conflicts, tables should contain few to none
    records in production.
                                                            7
Prerequisites


    Data Vault designed and implemented in database

    Staging tables and loading procedures in place
(can also be generic, we use PDI Metadata Injection step for loading files)

    Mapping from source to Data Vault specified
(now in an Excel sheet)




                                                                              8
Metadata tables




ref_data_vault_link_sources!




                               9
Design in LibreOffice (sources)




                                  10
Design in LibreOffice (hub + sat)




                                    11
Loading the metadata




                       12
'design errors'

Checks to avoid debugging:
(compares design metadata with Data Vault DB information_schema)



    hubs, links, satellites that don't exist in the DV

    key columns that do not exist in the DV

    missing connection data (source db)

    missing attribute columns


                                                                   13
A complete run




                 14
Spec: loading a hub

Load a hub, specified by:

    name

    key column

    business key column

    source table

    source table business key column
(can be expression, e.g. concatenate for composite key)


                                                          15
The Kettle objects: job hub




                              16
The Kettle objects: trf hub




                              17
Spec: loading a link

Load a link, specified by:

    name

    key column

    for each hub (maximum 10, can be a ref-table)
    
        hub name
    
        column name for the hub key in the link (roles!)
    
        column in the source table → business key of hub

    link 'attributes' (part of key, no hub, maximum 5)

    source table                                           18
The Kettle objects: job link




                               19
The Kettle objects: trf link

            Remove Unused 1 hub
            (peg-legged link)




                                  20
Spec: loading a hub satellite

Load a hub satellite, specified by:

    name

    key column

    hub name

    column in the source table → business key of hub

    for each attribute (maximum 200)
    
        source column
    
        target column

    source table                                       21
The Kettle objects: job hub sat




                                  22
The Kettle objects: trf hub sat




                                  23
Spec: loading a link satellite

Load a link satellite, specified by:

    name

    key column

    link name

    for each hub of the link:
column in the source table → business key of hub

    for each key attribute: source column

    for each attribute: source column → target column
                                                        24

    source table
Executing in a loop ..




                         25
.. and parallel




                  26
Logging

Default PDI logging enabled (e.g. errors)
N times 'generic job' is not so informative, so the jobs
  log:
   
       hub name
   
       link name
   
       hub satellite name
   
       link satellite name
   
       number of rows as start/end
   
       start/end time
                                                           27
Some points of interest


    Easy to make mistake in design sheet

    Generic → a bit harder to maintain and debug

    Application/tool to maintain metadata?

    Doc&%#$#@%tation (internals, checklists)




                                                   28
Availability of the code


    Free, because that's fair. I make a living with stuff that
    other people give away for free.

    Two flavours for now, MySQL and PostgreSQL.
    Oracle is 'under construction'.

    It's not on SourceForge, just mail me some Belgium
    beer and you get the code.




                                                             29

More Related Content

What's hot

Dbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easyDbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easyFranck Pachot
 
Spider HA 20100922(DTT#7)
Spider HA 20100922(DTT#7)Spider HA 20100922(DTT#7)
Spider HA 20100922(DTT#7)Kentoku
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
 
Accessing Databases from R
Accessing Databases from RAccessing Databases from R
Accessing Databases from RJeffrey Breen
 
Nov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeNov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeO'Reilly Media
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Developing and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWDeveloping and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWJonathan Katz
 
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10Altinity Ltd
 
Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Josh Elser
 
Foreign Data Wrapper Enhancements
Foreign Data Wrapper EnhancementsForeign Data Wrapper Enhancements
Foreign Data Wrapper EnhancementsShigeru Hanada
 
Advanced Sharding Techniques with Spider (MUC2010)
Advanced Sharding Techniques with Spider (MUC2010)Advanced Sharding Techniques with Spider (MUC2010)
Advanced Sharding Techniques with Spider (MUC2010)Kentoku
 
Setup oracle golden gate 11g replication
Setup oracle golden gate 11g replicationSetup oracle golden gate 11g replication
Setup oracle golden gate 11g replicationKanwar Batra
 
Writing A Foreign Data Wrapper
Writing A Foreign Data WrapperWriting A Foreign Data Wrapper
Writing A Foreign Data Wrapperpsoo1978
 
Oracle in-Memory Column Store for BI
Oracle in-Memory Column Store for BIOracle in-Memory Column Store for BI
Oracle in-Memory Column Store for BIFranck Pachot
 
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster PerformanceWebinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster PerformanceAltinity Ltd
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseAltinity Ltd
 
Power JSON with PostgreSQL
Power JSON with PostgreSQLPower JSON with PostgreSQL
Power JSON with PostgreSQLEDB
 

What's hot (20)

Dbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easyDbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easy
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Spider HA 20100922(DTT#7)
Spider HA 20100922(DTT#7)Spider HA 20100922(DTT#7)
Spider HA 20100922(DTT#7)
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
Accessing Databases from R
Accessing Databases from RAccessing Databases from R
Accessing Databases from R
 
Nov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeNov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars george
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Developing and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWDeveloping and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDW
 
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
 
Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Calcite meetup-2016-04-20
Calcite meetup-2016-04-20
 
Foreign Data Wrapper Enhancements
Foreign Data Wrapper EnhancementsForeign Data Wrapper Enhancements
Foreign Data Wrapper Enhancements
 
Advanced Sharding Techniques with Spider (MUC2010)
Advanced Sharding Techniques with Spider (MUC2010)Advanced Sharding Techniques with Spider (MUC2010)
Advanced Sharding Techniques with Spider (MUC2010)
 
Setup oracle golden gate 11g replication
Setup oracle golden gate 11g replicationSetup oracle golden gate 11g replication
Setup oracle golden gate 11g replication
 
Writing A Foreign Data Wrapper
Writing A Foreign Data WrapperWriting A Foreign Data Wrapper
Writing A Foreign Data Wrapper
 
Oracle in-Memory Column Store for BI
Oracle in-Memory Column Store for BIOracle in-Memory Column Store for BI
Oracle in-Memory Column Store for BI
 
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster PerformanceWebinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
 
Power JSON with PostgreSQL
Power JSON with PostgreSQLPower JSON with PostgreSQL
Power JSON with PostgreSQL
 
Advanced Sqoop
Advanced Sqoop Advanced Sqoop
Advanced Sqoop
 

Viewers also liked

Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?Huy Nguyen
 
SeSQL : un moteur de recherche en Python et PostgreSQL
SeSQL : un moteur de recherche en Python et PostgreSQLSeSQL : un moteur de recherche en Python et PostgreSQL
SeSQL : un moteur de recherche en Python et PostgreSQLParis, France
 
Chp2 - Les Entrepôts de Données
Chp2 - Les Entrepôts de DonnéesChp2 - Les Entrepôts de Données
Chp2 - Les Entrepôts de DonnéesLilia Sfaxi
 

Viewers also liked (6)

Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?
 
SeSQL : un moteur de recherche en Python et PostgreSQL
SeSQL : un moteur de recherche en Python et PostgreSQLSeSQL : un moteur de recherche en Python et PostgreSQL
SeSQL : un moteur de recherche en Python et PostgreSQL
 
Bi
BiBi
Bi
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Chp2 - Les Entrepôts de Données
Chp2 - Les Entrepôts de DonnéesChp2 - Les Entrepôts de Données
Chp2 - Les Entrepôts de Données
 

Similar to Data Vault ETL Kettle Jobs

Presentation pdi data_vault_framework_meetup2012
Presentation pdi data_vault_framework_meetup2012Presentation pdi data_vault_framework_meetup2012
Presentation pdi data_vault_framework_meetup2012Pentaho Community
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Alternator webinar september 2019
Alternator webinar   september 2019Alternator webinar   september 2019
Alternator webinar september 2019Nadav Har'El
 
A Dictionary Of Vb .Net
A Dictionary Of Vb .NetA Dictionary Of Vb .Net
A Dictionary Of Vb .NetLiquidHub
 
Implementing a build manager in Ada
Implementing a build manager in AdaImplementing a build manager in Ada
Implementing a build manager in AdaStephane Carrez
 
Obevo Javasig.pptx
Obevo Javasig.pptxObevo Javasig.pptx
Obevo Javasig.pptxLadduAnanu
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
android sqlite
android sqliteandroid sqlite
android sqliteDeepa Rani
 
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible APIIntroducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible APIScyllaDB
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineChester Chen
 
Using PowerShell to Simplify your ETL
Using PowerShell to Simplify your ETLUsing PowerShell to Simplify your ETL
Using PowerShell to Simplify your ETLScott Weinstein
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBLee Theobald
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi
 

Similar to Data Vault ETL Kettle Jobs (20)

Presentation pdi data_vault_framework_meetup2012
Presentation pdi data_vault_framework_meetup2012Presentation pdi data_vault_framework_meetup2012
Presentation pdi data_vault_framework_meetup2012
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Alternator webinar september 2019
Alternator webinar   september 2019Alternator webinar   september 2019
Alternator webinar september 2019
 
Mach-O Internals
Mach-O InternalsMach-O Internals
Mach-O Internals
 
NoSql databases
NoSql databasesNoSql databases
NoSql databases
 
A Dictionary Of Vb .Net
A Dictionary Of Vb .NetA Dictionary Of Vb .Net
A Dictionary Of Vb .Net
 
Implementing a build manager in Ada
Implementing a build manager in AdaImplementing a build manager in Ada
Implementing a build manager in Ada
 
Obevo Javasig.pptx
Obevo Javasig.pptxObevo Javasig.pptx
Obevo Javasig.pptx
 
Linq to sql
Linq to sqlLinq to sql
Linq to sql
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
android sqlite
android sqliteandroid sqlite
android sqlite
 
lecture_34e.pptx
lecture_34e.pptxlecture_34e.pptx
lecture_34e.pptx
 
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible APIIntroducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
Sbt
SbtSbt
Sbt
 
Using PowerShell to Simplify your ETL
Using PowerShell to Simplify your ETLUsing PowerShell to Simplify your ETL
Using PowerShell to Simplify your ETL
 
WebSphere Commerce v7 Data Load
WebSphere Commerce v7 Data LoadWebSphere Commerce v7 Data Load
WebSphere Commerce v7 Data Load
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

Data Vault ETL Kettle Jobs

  • 1. Introduction Edwin Weber Weber Solutions eacweber@gmail.com Back end of Data Warehousing MySQL, SQL Server, Oracle, PostgreSQL PDI, SSIS, Oracle Warehouse Builder (long ago)
  • 2. Project Sint Antonius hospital in Utrecht, Open Source oriented Chance to combine Kettle experience with a Data Vault (new to me) Practically at the same time: project SSIS and Data Vault So I jumped on the Data Vault bandwagon
  • 3. Data Vault ETL  Many objects to load, standardized procedures  This screams for a generic solution  I don't want to:  manage too many Kettle objects  connect similar columns in mappings by hand  Solution:  Generate Kettle objects?  Or take it one step further, there's only 1 parameterised hub load object. Don't need to know xml structure of PDI 3 objects.
  • 4. Goal  Generic ETL to load a Data Vault  Metadata driven  No generation, 1 object for each Data Vault entity  Hub  Link  Hub satellite  Link satellite  Define the mappings, create the Data Vault tables: done! 4
  • 5. Tools  Ubuntu  Pentaho Data Integration CE  LibreOffice Calc  MySQL 5.1  Cookbook, doc generation by Roland Bouman  (PostgreSQL 9.0, Oracle 11) 5
  • 6. Deliverables  Set of PDI jobs and transformations  Configuration files: kettle.properties shared.xml repositories.xml  Excel sheet that contains the specifications  Scripts to generate/populate the pdi_meta and data_vault databases (or schemas) 6
  • 7. Design decisions  Updateable views with generic column names  (MySQL more lenient than PostgreSQL)  Compare satellite attributes via string comparison (concatenate all columns, with | (pipe) as delimiter)  'inject' the metadata using Kettle parameters  Generate and use an error table for each Data Vault table. Kettle handles the errors. Helps to find DV design conflicts, tables should contain few to none records in production. 7
  • 8. Prerequisites  Data Vault designed and implemented in database  Staging tables and loading procedures in place (can also be generic, we use PDI Metadata Injection step for loading files)  Mapping from source to Data Vault specified (now in an Excel sheet) 8
  • 10. Design in LibreOffice (sources) 10
  • 11. Design in LibreOffice (hub + sat) 11
  • 13. 'design errors' Checks to avoid debugging: (compares design metadata with Data Vault DB information_schema)  hubs, links, satellites that don't exist in the DV  key columns that do not exist in the DV  missing connection data (source db)  missing attribute columns 13
  • 15. Spec: loading a hub Load a hub, specified by:  name  key column  business key column  source table  source table business key column (can be expression, e.g. concatenate for composite key) 15
  • 16. The Kettle objects: job hub 16
  • 17. The Kettle objects: trf hub 17
  • 18. Spec: loading a link Load a link, specified by:  name  key column  for each hub (maximum 10, can be a ref-table)  hub name  column name for the hub key in the link (roles!)  column in the source table → business key of hub  link 'attributes' (part of key, no hub, maximum 5)  source table 18
  • 19. The Kettle objects: job link 19
  • 20. The Kettle objects: trf link Remove Unused 1 hub (peg-legged link) 20
  • 21. Spec: loading a hub satellite Load a hub satellite, specified by:  name  key column  hub name  column in the source table → business key of hub  for each attribute (maximum 200)  source column  target column  source table 21
  • 22. The Kettle objects: job hub sat 22
  • 23. The Kettle objects: trf hub sat 23
  • 24. Spec: loading a link satellite Load a link satellite, specified by:  name  key column  link name  for each hub of the link: column in the source table → business key of hub  for each key attribute: source column  for each attribute: source column → target column 24  source table
  • 25. Executing in a loop .. 25
  • 27. Logging Default PDI logging enabled (e.g. errors) N times 'generic job' is not so informative, so the jobs log:  hub name  link name  hub satellite name  link satellite name  number of rows as start/end  start/end time 27
  • 28. Some points of interest  Easy to make mistake in design sheet  Generic → a bit harder to maintain and debug  Application/tool to maintain metadata?  Doc&%#$#@%tation (internals, checklists) 28
  • 29. Availability of the code  Free, because that's fair. I make a living with stuff that other people give away for free.  Two flavours for now, MySQL and PostgreSQL. Oracle is 'under construction'.  It's not on SourceForge, just mail me some Belgium beer and you get the code. 29