Data Vault 2.0: Big Data Meets Data Warehousing

DATA VAULT
2.0:
Big Data Meets Data Warehousing
DEAN HALLMAN
WIRESOFT, LLC

DATA WAREHOUSING VS BIG DATA
• Does Big Data replace Data Warehousing? Or do I need both?
• What’s the difference:
• Between the data flowing into a data warehouse vs big data tools?
• Between the ingestion processes and infrastructure?
• Data Lakes arrived with Big Data, so are they useful in Data
Warehousing?
• How should I model my data in EDW?
• 3NF, Star Schema, same as my operational data stores?
• Data Vault 2.0
• Graph Databases
• What is an architecture that allows both to co-exists effectively?

Impressions
(Big Data)
Core
Business
Services
Core
Business
Services
Core
Business
Services
Operational
Data Stores
D
A
T
A
L
A
K
E
Enterprise Data Warehouse
CDC,
snapshot
Internet
External
Data
Sources
Big Data Toolchain
Batch
(SerDe)
StagingVault
RawVault
BusinessVault
InformationMart
Streaming
(Kafka)
Streaming
Analytics
Batch Analytics
(Hadoop)
Schema-on-Read
Schema-on-Write
Data Source
Landing
Clients
ETL
ELT
𝛾
𝛾
𝛾
𝛾 BI Tools
Monitoring
Discovery
Audit
clickstream
(SerDe)
ETL ETL

Impressions
(Big Data)
Core
Business
Services
Core
Business
Services
Core
Business
Services
Operational
Data Stores
D
A
T
A
L
A
K
E
CDC,
snapshot
Internet
External
Data
Sources
Big Data Toolchain
Batch
(SerDe)
StagingVault
RawVault
BusinessVault
InformationMart
Streaming
(Kafka)
Streaming
Analytics
Batch Analytics
(Hadoop)
Schema-on-Read
Schema-on-Write
Data Source
Landing
Clients
ETL
ELT
𝛾
𝛾
𝛾
𝛾 BI Tools
Monitoring
Discovery
Audit
clickstream
(SerDe)
ETL ETL
THE DATA MODEL

DATA VAULT 2.0
COMMON FOUNDATIONAL WAREHOUSE ARCHITECTURE
• “The Data Vault Model is a detail oriented, historical tracking and uniquely linked
set of normalized tables that support one or more functional areas of business. It is a
hybrid approach encompassing the best of breed between 3rd normal form (3NF)
and star schema. The design is flexible, scalable, consistent and adaptable to the
needs of the enterprise” -- Dan Linstedt, Creator of Data Vault
• Data loaded as-is from sources, no edits or cleanup
• Append-only to afford highest performance
• Agile & agnostic to changes in the operational store’s data model
• Essentially, a prescription for Layered Graph to Relational Mapping

DATA WAREHOUSING & DATA VAULT 2.0
• 60’s, 70’s, 80’s
• E.F. Codd => 3NF
• Bill Inmon invents Data Warehousing
concept
• Dr. Ralph Kimball popularizes Star
Schema design
• 90’s, 00’s:
• Dan Linstedt creates Data Vault Model @
DOD
• 2014:
• Dan Introduces Data Vault 2.0

Source: “What are Graph Databases and Why should I care?“, by Dave Bechberger of Expero

RELATIONAL VS GRAPH DATABASES
• Enterprise Grade
• Well-worn path
• SQL has been relatively stagnant vs programming languages

GRAPH DATA MODEL
Source: https://neo4j.com/developer/graph-database/

Flight
Base Dest Forecast
Record
Source
LoadDate Depart Gate
LGA 2018-10-11 1:25P
M
B27
CAE 2018-10-24 3:30P
M
A14
SFO 2018-09-06 8:55P
M
G19
RDU 2018-08-12 4:45P
M
C22
SERVICED_BY
Record Source Airport CAE
Load Date 2018-11-17
Source Id 20181117-32-983
Aircraft
Base Service FAA NTSB
Record
Source
LoadDate Model Tailno
United 2017-02-11 767 1477
Delta 2015-11-04 A6 2381
Alaska 2013-08-28 747 8312
Frontie
r
2016-07-19 182 1438
Record Source United Airlines
Source Id 2412c
SERVICED_BY
Base Dest Manifest
Record
Source
LoadDate Begin End
United 2017-02-11 2017-04-23 2017-09-23
Delta 2015-11-04 2015-12-01 2017-04-22
Alaska 2013-08-28 2013-09-14 2016-05-04
Frontie
r
2016-07-19 2016-08-02 2018-04-11
Hubs
Links
SatellitesTab

• Organizations which design systems ...
are constrained to produce designs
which are copies of the communication
structures of these organizations
- Mel Conway

FLIGHT
Base Dest Forecast
Record
Source
LoadDate Depart Gate
LGA 2018-10-
11
1:25P
M
B27
CAE 2018-10-
24
3:30P
M
A14
FLIGHT
Record Source Airport CAE
Source Id 20181117-32-983
Aircraft
Bas
e
Service FAA NTSB
Record
Source
United 2017-02-
11
767 1477
Delta 2015-11-
04
A6 2381
Alaska 2013-08-
28
747 8312
Frontie
r
2016-07-
19
182 1438
Source Id 2412c
Airport
Base Dest Manifest
Record
Source
LoadDate Begin End
United 2017-02-11 2017-04-23 2017-09-
23
Delta 2015-11-04 2015-12-01 2017-04-
22
Alaska 2013-08-28 2013-09-14 2016-05-
04
Frontie
r
2016-07-19 2016-08-02 2018-04-
11
Airline
Base Service FAA
NTS
B
Record
Source
United 2017-02-11 767 1477
Delta 2015-11-04 A6 2381
Source Id 2412c
Hubs
Links
SatellitesTab

Source: https://www.wherescape.com/solutions/project-types/data-vault-automation/

• Modeled after self-
organizing networks
• A Business Key identifies a
key concept in business.
• They have a business
meaning
• They are unique and
have very low propensity
to change
• Business keys change
only when the business
change
• Enables (forces) cross-
source modeling
Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf

Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf

DATA VAULT 2.0 MODELING:
HUBS, LINKS & SATELLITES

Impressions
(Big Data)
Core
Business
Services
Core
Business
Services
Core
Business
Services
Operational
Data Stores
D
A
T
A
L
A
K
E
CDC,
snapshot
Internet
External
Data
Sources
Big Data Toolchain
Batch
(SerDe)
StagingVault
RawVault
BusinessVault
InformationMart
Streaming
(Kafka)
Streaming
Analytics
Batch Analytics
(Hadoop)
Schema-on-Read
Schema-on-Write
Data Source
Landing
Clients
ETL
ELT
𝛾
𝛾
𝛾
𝛾 BI Tools
Monitoring
Discovery
Audit
clickstream
(SerDe)
ETL ETL
THE DATA
Impressions vs Business Data

ENTERPRISE DATA SILOS
Small DataLarge DataBig Data
Describes the
user base
Describes the
Enterprise
Describes the
Product

Instance
Grain
Transaction
Grain
Audit Grain
Impression Grain
Big Data
Enterprise Data
Warehouse
Operational Data Stores
Impression
Analytics
Business
Analytics
External Data Sources
DATA GRANULARITY FUNNEL

Impressions
(Big Data)
Core
Business
Services
Core
Business
Services
Core
Business
Services
Operational
Data Stores
D
A
T
A
L
A
K
E
CDC,
snapshot
Internet
External
Data
Sources
Big Data Toolchain
Batch
(SerDe)
StagingVault
RawVault
BusinessVault
InformationMart
Streaming
(Kafka)
Streaming
Analytics
Batch Analytics
(Hadoop)
Schema-on-Read
Schema-on-Write
Data Source
Landing
Clients
ETL
ELT
𝛾
𝛾
𝛾
𝛾 BI Tools
Monitoring
Discovery
Audit
clickstream
(SerDe)
ETL ETL
DATA INGESTION
ETL vs ELT vs SerDe

ETL
VS
ELT
VS
SerDe
• Beware the Turing tar-pit, in which
everything is possible, but nothing
of interest is easy
- Alan Perlis

DATA CLASSIFICATION
MATRIX:
DECLARATIVE VS INTERPRETIVE
Declarative Interpretive
HadoopRDBMS
Web Events
Media Player

DATA WAREHOUSING
• Deep Topic
• 60’s, 70’s, 80’s
• E.F. Codd => 3NF
• Bill Inmon invents Data Warehousing
concept
• Dr. Ralph Kimball popularizes Star Schema
design
• 90’s, 00’s:
• Dan Linstedt creates Data Vault Model @
DOD
• 2014:
• Dan Introduces Data Vault 2.0
• Data Warehouse vs Operational Data
Stores
• Data Warehouse as Version Control System
• MapReduce, 2004, Google by Jeffery
Dean and Sanjay, “MAPREDUCE:
SIMPLIFIED DATA PROCESSING ON
LARGE CLUSTERS” , GFS
• Nutch 2005, Hadoop 2006, 2007 - Doug
Cutting
• What exactly is “Big Data”?
BIG DATA

Client
User
Interpreter
Analysis
UNSTRUCTURED USER EXPERIENCE
L
L
n L
ilossy

Client
User
Time Series
Event
Record
Analysis
STRUCTURED USER EXPERIENCE
losslessL
p L
p
L
e

ETL OR SERDE ?
S3
Hadoop
Time Series
Event Record
Analysis
Deserializer
L e
L
d
L
m
Client
User
Serializer
L p
L
p
Eventlog.e Eventlog.d
L
e
Single Source
(Version Locked)
Kafka/Kinesis
LeInternet

ETL
ELT
(SerDe)
vs
Source: https://www.ironsidegroup.com/2015/03/01/etl-vs-elt-whats-the-big-difference/
Schema
On
Write
Schema
On
Read

OTHER CHALLENGES
• Satellites must be loaded chronologically
• Time-based scheduling vs data-availability scheduling

QUESTIONS?
• Contact:
 Dean Hallman
 rdhallman@gmail.com
 Linkedin: https://www.linkedin.com/in/dean-hallman/

Data Vault 2.0: Big Data Meets Data Warehousing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Vault 2.0: Big Data Meets Data Warehousing

Similar to Data Vault 2.0: Big Data Meets Data Warehousing (20)

More from All Things Open

More from All Things Open (20)

Recently uploaded

Recently uploaded (20)

Data Vault 2.0: Big Data Meets Data Warehousing

Editor's Notes