Introductionneacweber@gmail.com
Data Vault DefinitionThe Data Vault is a detail oriented, historical tracking and uniquelylinked set of normalized tables ...
Data Vault Building Blocks                                                                  different sources/rate of chan...
Data Vault Fundamentals: HubSource: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
Data Vault Fundamentals: LinkSource: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
Data Vault Fundamentals: Satellite   Source: data-vault-modeling-guide   GENESEE ACADEMY, LLC, Hans Hultgren
Data Vault Fundamentals: ModelSource: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
Data Vault ETLMany objects to load, standardized proceduresThis screams for a generic solution!I dont want to:  throw ETL ...
ToolsOperating System                Database  Virtualization                   Data Integration       ProductivitySql Dev...
Place of framework in architecture  Files            MySQL      ETL:                      Kettle   MySQL DBMS ETL         ...
What has to be taken care of?Data Vault designed and implemented in databaseStaging tables and loading procedures in place...
Framework componentsPDI repository (file based), jobs and transformationsConfiguration files:kettle.propertiesshared.xmlre...
Design decisionsUpdateable views with generic column names(MySQL more lenient than PostgreSQL)Compare satellite attributes...
Metadata tablesAll have history tables
Metadata in Excel                    Data Vault                    connections                    source systems          ...
Metadata in Excel (hub + sat)          x 200 (max)
Metadata in Excel (link)                         x 10link attributes
Metadata in Excel (link satellite)                  x 10                  x5  x 200 (max)
Last seen dateapplicable for hubs and linksexisting hubs and links: update last_seen_dts!
Link validity satelliteLink has business key: not all hub ids
Loading the metadata
design errorsChecks to avoid debugging:(compares design metadata with Data Vault DB information_schema)  hubs, links, sate...
A complete run
Metadata needed for a hubnamekey columnbusiness key columnsource tablesource table business key column(can be expression, ...
Job for hub
Transformation for hub
Metadata needed for a linknamekey columnfor each hub (maximum 10, can be a ref-table)   hub name   column name for the hub...
Job for link
Transformation for link                  Last seen?Lookup hubs                               Remove columns not in link   ...
Metadata needed for a hub satellite  name  key column  hub name  column in the source table → business key of hub  for eac...
Job for hub satellite
Transformation for hub satellite
Metadata needed for a link satellitenamekey columnlink namefor each hub of the link:column in the source table → business ...
Job for link satellite
Transformation for link satellite
Executing in a loop ..
.. and parallel
LoggingCustom logging                                   PDI loggingConfiguring log tablesfor concurrent access
Version Control: PDI objects
Version Control: database objects
Some points of interestEasy to make mistake in design sheetGeneric → a bit harder to maintain and debugApplication/tool to...
Sourceforge!
Upcoming SlideShare
Loading in …5
×

Presentation pdi data_vault_framework_meetup2012

504 views

Published on

Edwin Weber's PDI Data Vault framework

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
504
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Presentation pdi data_vault_framework_meetup2012

  1. 1. Introductionneacweber@gmail.com
  2. 2. Data Vault DefinitionThe Data Vault is a detail oriented, historical tracking and uniquelylinked set of normalized tables that support one or more functionalareas of business. It is a hybrid approach encompassing the bestof breed between 3rd normal form (3NF) and star schema. Thedesign is flexible, scalable, consistent and adaptable to the needsof the enterprise. It is a data model that is architected specificallyto meet the needs of enterprise data warehouses.Source: Dan Linstedthttp://www.tdan.com/view-articles/5054/
  3. 3. Data Vault Building Blocks different sources/rate of changeSource: Dan Linstedthttp://www.slideshare.net/dlinstedt/introduction-to-data-vault-dama-oregon-2012
  4. 4. Data Vault Fundamentals: HubSource: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
  5. 5. Data Vault Fundamentals: LinkSource: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
  6. 6. Data Vault Fundamentals: Satellite Source: data-vault-modeling-guide GENESEE ACADEMY, LLC, Hans Hultgren
  7. 7. Data Vault Fundamentals: ModelSource: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
  8. 8. Data Vault ETLMany objects to load, standardized proceduresThis screams for a generic solution!I dont want to: throw ETL tool away and code it all myself manage too many ETL objects connect similar columns in mappings by handI do want to: generate ETL (Kettle) objects? No Take it one step further: theres only 1 parameterised hub load object. Dont need to know xml structure of PDI objects
  9. 9. ToolsOperating System Database Virtualization Data Integration ProductivitySql Development Version Control
  10. 10. Place of framework in architecture Files MySQL ETL: Kettle MySQL DBMS ETL Data Data ETL Vault Vault CSV Frame Files work ERP Staging Central DWH & Area Data MartsSources ETL Process Data Warehouse EUL
  11. 11. What has to be taken care of?Data Vault designed and implemented in databaseStaging tables and loading procedures in place(can also be generic, we use PDI Metadata Injection step for loadingfiles)Mapping from source to Data Vault specified(now in an Excel sheet)
  12. 12. Framework componentsPDI repository (file based), jobs and transformationsConfiguration files:kettle.propertiesshared.xmlrepositories.xmlExcel sheet that contains the specificationsMySQL database for metadataVirtual machine with Ubuntu 12.04 Server
  13. 13. Design decisionsUpdateable views with generic column names(MySQL more lenient than PostgreSQL)Compare satellite attributes via string comparison(concatenate all columns, with | (pipe) as delimiter)inject the metadata using Kettle parametersGenerate and use an error table for each Data Vaulttable
  14. 14. Metadata tablesAll have history tables
  15. 15. Metadata in Excel Data Vault connections source systems source tables
  16. 16. Metadata in Excel (hub + sat) x 200 (max)
  17. 17. Metadata in Excel (link) x 10link attributes
  18. 18. Metadata in Excel (link satellite) x 10 x5 x 200 (max)
  19. 19. Last seen dateapplicable for hubs and linksexisting hubs and links: update last_seen_dts!
  20. 20. Link validity satelliteLink has business key: not all hub ids
  21. 21. Loading the metadata
  22. 22. design errorsChecks to avoid debugging:(compares design metadata with Data Vault DB information_schema) hubs, links, satellites that dont exist in the DV key columns that do not exist in the DV missing connection data (source db) missing attribute columns
  23. 23. A complete run
  24. 24. Metadata needed for a hubnamekey columnbusiness key columnsource tablesource table business key column(can be expression, e.g. concatenate for composite key)
  25. 25. Job for hub
  26. 26. Transformation for hub
  27. 27. Metadata needed for a linknamekey columnfor each hub (maximum 10, can be a ref-table) hub name column name for the hub key in the link (roles!) column in the source table → business key of hublink attributes (part of key, no hub, maximum 5)link validity satellite needed?last seen date needed?source table
  28. 28. Job for link
  29. 29. Transformation for link Last seen?Lookup hubs Remove columns not in link Run table needed for validity sat ?
  30. 30. Metadata needed for a hub satellite name key column hub name column in the source table → business key of hub for each attribute (maximum 200) source column target column source table
  31. 31. Job for hub satellite
  32. 32. Transformation for hub satellite
  33. 33. Metadata needed for a link satellitenamekey columnlink namefor each hub of the link:column in the source table → business key of hubfor each key attribute: source columnfor each attribute: source column → target columnsource table
  34. 34. Job for link satellite
  35. 35. Transformation for link satellite
  36. 36. Executing in a loop ..
  37. 37. .. and parallel
  38. 38. LoggingCustom logging PDI loggingConfiguring log tablesfor concurrent access
  39. 39. Version Control: PDI objects
  40. 40. Version Control: database objects
  41. 41. Some points of interestEasy to make mistake in design sheetGeneric → a bit harder to maintain and debugApplication/tool to maintain metadata?Data Vault generators (e.g. Quipu)?Spinoff using Informatica and Oracle: Sander RobijnsThanks to: Jos van Dongen Kasper de Graaf
  42. 42. Sourceforge!

×