Introduction to Data Vault
Ilja Dmitrijev (www.in-volv.com)
http://www.linkedin.com/in/iljadmitrijev
Wildcard conference, ...
What we are expecting from Data Warehouse/BI?
Time-Variant
(historized)
Non volatile
(no updates)
Integrated,
Enterprise
w...
What do we have in DWH design now?
Star schemas
OLAP
Big Data*
*Big Data = Columnar distributed data stores
Strong in
subj...
Place of Data Vault in DWH architecture
Heavy tasks of integration,
historization and cleaning
performed in Data Vault
Dat...
The main idea of Data Vault
Break things out into component parts for flexibility and to facilitate the
capture of things ...
Hub
A Hub Construct in Data Vault contains
Business Key
only the Business Key
contains No Context
A Hub Table contains ...
Link
A Link Construct in Data Vault contains
 Relationship
only a Relationship
contains No Context
is always 1:1 with ...
Satellite
a Satellite Construct in Data Vault contains
Context only
has no FKs (no relationships)
Is attached to hub or...
Decomposition example
Handle “data explosion” issue
Vertical partitioning
Isolation of structural changes
Zero updates...
Data Vault structure example
How DV contributes to incremental build and agility?
You may start to model even if full scope is unknown
Simple hub, li...
Structure Extension Examples
Few words about satellite design
There are no strict rules
Practitioners usually split satellites:
– By data source - simp...
Cleaning, deduplicating, integrating data in Data Vault
Data Vault follows
principle that all data
should be traceable
bac...
Technical implementation
considerations
Fast, massive
parallel load into
Data Vault
Fast retrieval of data from
Data Vault...
Summary of Data Vault advantages
Incremental
build, easy to
adopt for
changes
Out of the
box data
historization,
integrati...
Some usefull resources
Data Vault Discussions
DataVaultAcademy
www.GeneseeAcademy.com;
http://danlinstedt.com
http://www.a...
Credits
Slides with Data Vault and data vault
elements formal definitions are kindly
provided by Hans Hultgren
(www.Genese...
Upcoming SlideShare
Loading in …5
×

Introduction to data vault ilja dmitrijev

806 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
806
On SlideShare
0
From Embeds
0
Number of Embeds
39
Actions
Shares
0
Downloads
28
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Introduction to data vault ilja dmitrijev

  1. 1. Introduction to Data Vault Ilja Dmitrijev (www.in-volv.com) http://www.linkedin.com/in/iljadmitrijev Wildcard conference, Riga 2013
  2. 2. What we are expecting from Data Warehouse/BI? Time-Variant (historized) Non volatile (no updates) Integrated, Enterprise wide Subject oriented ETL & Query performance Easy adopting changes Auditable
  3. 3. What do we have in DWH design now? Star schemas OLAP Big Data* *Big Data = Columnar distributed data stores Strong in subject oriented querying of the data Auditability? Historization? Integration? Enterpise wide? Easy to adopt? Data Vault Presented by Dan Linstedt in 2000
  4. 4. Place of Data Vault in DWH architecture Heavy tasks of integration, historization and cleaning performed in Data Vault Data Marts (Star Schemas, OLAP, BigData) are lightweight, presentation only, rebuild/reload in hours
  5. 5. The main idea of Data Vault Break things out into component parts for flexibility and to facilitate the capture of things that are either interpreted in different ways or changing independently of each other. Decomposition. These parts however need to be integrated to define the core business concept (the Entity, the Dimension, etc.). So they must be kept together. Unified. Hub -The Natural Business Key Link -The Natural Business Relationships Satellite - All Context, Descriptive Data and History
  6. 6. Hub A Hub Construct in Data Vault contains Business Key only the Business Key contains No Context A Hub Table contains only Business Key Surrogate Key (Data Warehouse) Load Date / Time Stamp Record Source Hub identifies important to business entities Business key= value by which entity is referenced by business representatives (invoice number, account number, client number etc.)
  7. 7. Link A Link Construct in Data Vault contains  Relationship only a Relationship contains No Context is always 1:1 with Relationship A Link Table contains only Foreign keys for the Relationship (makes unique key of link table) Surrogate Key (Data Warehouse) Load Date / Time Stamp Record Source By Default all relations are considered as M:M which is far more natural then classical RDBMS foreign keys
  8. 8. Satellite a Satellite Construct in Data Vault contains Context only has no FKs (no relationships) Is attached to hub or link Designed by * Rate of Change * Type of Data * System… a Satellite Table contains only hub/link surrogate id Load Date / Time Stamp Context Data (attributes) Record Source Only one instance of satellite is valid at any time
  9. 9. Decomposition example Handle “data explosion” issue Vertical partitioning Isolation of structural changes Zero updates policy Supports real time data
  10. 10. Data Vault structure example
  11. 11. How DV contributes to incremental build and agility? You may start to model even if full scope is unknown Simple hub, links, sats design rules reduce design error rate As the scope of the DWH is expanded, the Data Vault can adapt to these changes without impacting the existing model. This is what allows the DWH to be built incrementally and to adapt to change without the need for re- engineering.
  12. 12. Structure Extension Examples
  13. 13. Few words about satellite design There are no strict rules Practitioners usually split satellites: – By data source - simplifies traceability – By context (e.g. identification, contact info, profile) – isolates structural changes – By rate of change - deal with data explosions Or combine approaches Extreme case: one satellite per attribute – helps to deal with unpredictably changing source structure
  14. 14. Cleaning, deduplicating, integrating data in Data Vault Data Vault follows principle that all data should be traceable back to the source and all data transformations made are auditable Data Vault is not only used for data capturing, historizing, but also for transformations: deduplication, deriving, cleaning etc. In Data Vault world you will encounter: Raw vault – original, Data Vaulted data w/o data creation (missing values are not replaced by default) Rule vault – additional satellites for cleaned, derived, deduplicated data
  15. 15. Technical implementation considerations Fast, massive parallel load into Data Vault Fast retrieval of data from Data Vault- data changing with high frequency is isolated Indexing and partitioning(horizontal) of data in Data Vault is not so crucial Remember that Data Vault shall not be accessed by end users via ad-hoc analysis tools and reports!
  16. 16. Summary of Data Vault advantages Incremental build, easy to adopt for changes Out of the box data historization, integration framework Simple design rules, business centric modelling Supports graphs, unstructured, real time data
  17. 17. Some usefull resources Data Vault Discussions DataVaultAcademy www.GeneseeAcademy.com; http://danlinstedt.com http://www.anchormodeling.com; 6NF Another methodologies applying similar modeling approach Quipu http://www.datawarehousemanagement.org convenient Data Modeling, ETL, SQL tools
  18. 18. Credits Slides with Data Vault and data vault elements formal definitions are kindly provided by Hans Hultgren (www.GeneseeAcademy.com)

×