Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Agile Data Warehouse Modeling:
Introduction to Data Vault Modeling
Kent Graziano
Data Warrior LLC
Twitter @KentGraziano

Agenda
 Bio
 What do we mean by Agile?
 What is a Data Vault?
 Where does it fit in an DW/BI architecture
 How to design a Data Vault model
 Being “agile”
#OUGF14

My Bio
 Oracle ACE Director
 Certified Data Vault Master and DV 2.0 Architect
 Member: Boulder BI Brain Trust
 Data Architecture and Data Warehouse Specialist
● 30+ years in IT
● 25+ years of Oracle-related work
● 20+ years of data warehousing experience
 Co-Author of
● The Business of Data Vault Modeling
● The Data Model Resource Book (1st Edition)
 Past-President of ODTUG and Rocky Mountain Oracle
User Group
#OUGF14

Manifesto for Agile Software Development
 “We are uncovering better ways of developing
software by doing it and helping others do it.
 Through this work we have come to value:
 Individuals and interactions over processes and
tools
 Working software over comprehensive
documentation
 Customer collaboration over contract negotiation
 Responding to change over following a plan
 That is, while there is value in the items on the right,
we value the items on the left more.”
 http://agilemanifesto.org/
#OUGF14

Applying the Agile Manifesto to DW
 User Stories instead of
requirements documents
 Time-boxed iterations
● Iteration has a standard length
● Choose one or more user stories to fit in that
iteration
 Rework is part of the game
● There are no “missed requirements”... only
those that haven’t been delivered or
discovered yet.
(C) Kent Graziano
#OUGF14

Data Vault Definition
The Data Vault is a detail oriented, historical tracking
and uniquely linked set of normalized tables that
support one or more functional areas of business.
It is a hybrid approach encompassing the best of
breed between 3rd normal form (3NF) and star
schema. The design is flexible, scalable, consistent
and adaptable to the needs of the enterprise.
Dan Linstedt: Defining the Data Vault
TDAN.com Article
Architected specifically to meet the needs
of today’s enterprise data warehouses
#OUGF14

What is Data Vault Trying to Solve?
 What are our other Enterprise
Data Warehouse options?
● Third-Normal Form (3NF): Complex
primary keys (PK’s) with cascading
snapshot dates
● Star Schema (Dimensional): Difficult to
reengineer fact tables for granularity
changes
 Difficult to get it right the first
time
 Not adaptable to rapid
business change
 NOT AGILE!
(C) Kent Graziano
#OUGF14

Data Vault Time Line
20001960 1970 1980 1990
E.F. Codd invented
relational modeling
Chris Date and
Hugh Darwen
Maintained and
Refined
Modeling
1976 Dr Peter Chen
Created E-R
Diagramming
Early 70’s Bill
Inmon Began
Discussing Data
Warehousing
Mid 60’s Dimension & Fact
Modeling presented by
General Mills and Dartmouth
University
Mid 70’s AC Nielsen
Popularized
Dimension & Fact Terms
Mid – Late 80’s Dr Kimball
Popularizes Star Schema
Mid 80’s Bill Inmon
Popularizes Data
Warehousing
Late 80’s – Barry
Devlin and Dr Kimball
Release “Business
Data Warehouse”
1990 – Dan Linstedt
Begins R&D on Data
Vault Modeling
2000 – Dan Linstedt
releases first 5
articles on Data Vault
Modeling
#OUGF14

Data Vault Evolution
 The work on the Data Vault approach began in the
early 1990s, and completed around 1999.
 Throughout 1999, 2000, and 2001, the Data Vault
design was tested, refined, and deployed into specific
customer sites.
 In 2002, the industry thought leaders were asked to
review the architecture.
● This is when I attend my first DV seminar in Denver and met
Dan!
 In 2003, Dan began teaching the modeling techniques
to the mass public.
 Now in 2014, Dan introduced DV 2.0!
(C) Kent Graziano
#OUGF14

Where does a Data Vault Fit?
#OUGF14

Where does Data Vault fit?
Data Vault goes here
#OUGF14

How to be Agile using DV
 Model iteratively
● Use Data Vault data modeling technique
● Create basic components, then add over time
 Virtualize the Access Layer
● Don’t waste time building facts and dimensions up front
● ETL and testing takes too long
● “Project” objects using pattern-based DV model with
database views (or BI meta layer)
 Users see real reports with real data
 Can always build out for performance in
another iteration
(C) Kent Graziano
#OUGF14

Data Vault: 3 Simple Structures
#OUGF14

Data Vault Core Architecture
 Hubs = Unique List of Business Keys
 Links = Unique List of Relationships across
keys
 Satellites = Descriptive Data
 Satellites have one and only one parent table
 Satellites cannot be “Parents” to other tables
 Hubs cannot be child tables
© LearnDataVault.com
#OUGF14

Common Attributes
 Required – all structures
● Primary key – PK
● Load date time stamp – DTS
● Record source – REC_SRC
 Required – Satellites only
● Load end date time stamp – LEDTS
● Optional in DV 2.0
 Optional – Extract Dates –Extrct_DTS
 Optional – Hubs & Links only
● Last seen dates – LSDTs
● MD5KEY
 Optional – Satellites only
● Load sequence ID – LDSEQ_ID
● Update user – UPDT_USER
● Update DTS – UPDT_DTS
● MD5DIFF
#OUGF14

1. Hub = Business Keys
Hubs = Unique Lists of Business Keys
Business Keys are used to
TRACK and IDENTIFY key information
New: DV 2.0 includes MD5 of the BK to
link to Hadoop/NoSQL
(C) Kent Graziano #OUGF14

2: Links = Associations
Links =
Transactions and
Associations
They are used to
hook together
multiple sets of
information
In DV 2.0 the BK
attributes migrate
to the Links for
faster query
(C) Kent Graziano
#OUGF14

Modeling Links - 1:1 or 1:M?
 Today:
● Relationship is a 1:1 so why model a Link?
 Tomorrow:
● The business rule can change to a 1:M.
● You discover new data later.
 With a Link in the Data Vault:
● No need to change the EDW structure.
● Existing data is fine.
● New data is added.
(C) Kent Graziano
#OUGF14

3. Satellites = Descriptors
•Satellites provide
context for the
Hubs and the
Links
•Tracks changes
over time
•Like SCD 2
(C) Kent Graziano
#OUGF14

This model is partially
compliant with Hadoop.
The Hash Keys can be
used to join to Hadoop
data sets.
Note: Business Keys
replicated to the Link
structure for “join”
capabilities on the way
out to Data Marts.
What’s New in DV2.0?
#OUGF14

Data Vault Model Flexibility (Agility)
 Goes beyond standard 3NF
• Hyper normalized
● Hubs and Links only hold keys and meta data
● Satellites split by rate of change and/or source
• Enables Agile data modeling
● Easy to add to model without having to change existing
structures and load routines
• Relationships (links) can be dropped and created on-demand.
● No more reloading history because of a missed requirement
 Based on natural business keys
• Not system surrogate keys
• Allows for integrating data across functions and source
systems more easily
● All data relationships are key driven.
#OUGF14

Data Vault Extensibility
Adding new components to
the EDW has NEAR ZERO
impact to:
• Existing Loading
Processes
• Existing Data Model
• Existing Reporting & BI
Functions
• Existing Source Systems
• Existing Star Schemas
and Data Marts
(C) LearnDataVault.com #OUGF14

 Standardized modeling rules
• Highly repeatable and learnable modeling technique
• Can standardize load routines
● Delta Driven process
● Re-startable, consistent loading patterns.
• Can standardize extract routines
● Rapid build of new or revised Data Marts
• Can be automated
‣ Can use a BI-meta layer to virtualize the reporting
structures
‣ Example: OBIEE Business Model and Mapping tool
‣ Example: BOBJ Universe Business Layer
‣ Can put views on the DV structures as well
‣ Simulate ODS/3NF or Star Schemas
Data Vault Productivity
(C) Kent Graziano
#OUGF14

• The Data Vault holds granular historical
relationships.
• Holds all history for all time, allowing any
source system feeds to be reconstructed on-
demand
• Easy generation of Audit Trails for data lineage
and compliance.
• Data Mining can discover new relationships
between elements
• Patterns of change emerge from the historical
pictures and linkages.
• The Data Vault can be accessed by power-users
Data Vault Adaptability
(C) Kent Graziano
#OUGF14

Other Benefits of a Data Vault
 Modeling it as a DV forces integration of the Business Keys
upfront.
• Good for organizational alignment.
 An integrated data set with raw data extends it’s value beyond BI:
• Source for data quality projects
• Source for master data
• Source for data mining
• Source for Data as a Service (DaaS) in an SOA (Service Oriented Architecture).
 Upfront Hub integration simplifies the data integration routines
required to load data marts.
• Helps divide the work a bit.
 It is much easier to implement security on these granular pieces.
 Granular, re-startable processes enable pin-point failure
correction.
 It is designed and optimized for real-time loading in its core
architecture (without any tweaks or mods).
#OUGF14

Worlds Smallest Data Vault
 The Data Vault doesn’t have to be
“BIG”.
 An Data Vault can be built
incrementally.
 Reverse engineering one component
of the existing models is not
uncommon.
 Building one part of the Data Vault,
then changing the marts to feed from
that vault is a best practice.
 The smallest Enterprise Data
Warehouse consists of two tables:
● One Hub,
● One Satellite
Hub_Cust_Seq_ID
Hub_Cust_Num
Hub_Cust_Load_DTS
Hub_Cust_Rec_Src
Hub Customer
Hub_Cust_Seq_ID
Sat_Cust_Load_DTS
Sat_Cust_Load_End_DTS
Sat_Cust_Name
Sat_Cust_Rec_Src
Satellite Customer Name
#OUGF14

Notably…
 In 2008 Bill Inmon stated that the “Data Vault
is the optimal approach for modeling the EDW
in the DW2.0 framework.” (DW2.0)
 The number of Data Vault users in the US
surpassed 500 in 2010 and grows rapidly
(http://danlinstedt.com/about/dv-customers/)
#OUGF14

Organizations using Data Vault
 WebMD Health Services
 Anthem Blue-Cross Blue Shield
 MD Anderson Cancer Center
 Denver Public Schools
 Independent Purchasing Cooperative (IPC, Miami)
• Owner of Subway
 Kaplan
 US Defense Department
 Colorado Springs Utilities
 State Court of Wyoming
 Federal Express
 US Dept. Of Agriculture
#OUGF14

 Modeling Structure Includes…
● NoSQL, and Non-Relational DB systems, Hybrid Systems
● Minor Structure Changes to support NoSQL
 New ETL Implementation Standards
● For true real-time support
● For NoSQL support
 New Architecture Standards
● To include support for NoSQL data management systems
 New Methodology Components
● Including CMMI, Six Sigma, and TQM
● Including Project Planning, Tracking, and Oversight
● Agile Delivery Mechanisms
● Standards, and templates for Projects
#OUGF14

This model is fully
compliant with Hadoop,
needs NO changes to
work properly
RISK: Key Collision
#OUGF14

Summary
• Data Vault provides a data
modeling technique that
allows:
‣ Model Agility
‣ Enabling rapid changes and additions
‣ Productivity
‣ Enabling low complexity systems with high
value output at a rapid pace
‣ Easy projections of dimensional models
‣ So? Agile Data Warehousing?
#OUGF14

Super Charge Your Data Warehouse
Available on Amazon.com
Soft Cover or Kindle Format
Now also available in PDF at
LearnDataVault.com
Hint: Kent is the Technical
Editor
#OUGF14

Data Vault References
www.learndatavault.com
www.danlinstedt.com
On YouTube:
www.youtube.com/LearnDataVault
On Facebook:
www.facebook.com/learndatavault

Contact Information
Kent Graziano
The Oracle Data Warrior
Data Warrior LLC
Kent.graziano@att.net
Visit my blog at
http://kentgraziano.com

Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Similar to Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling (20)

More from Kent Graziano

More from Kent Graziano (17)

Recently uploaded

Recently uploaded (20)

Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Editor's Notes