Data Vault
VS
Data Lake
C o a c h F r u L o u i s | w w w. f r u l o u i s . c o m
Tech|Career|Inspiration
Data Basics
2
Agenda
@Coachfrulouis
Tech|Career|Inspiration
02. What is Data Vaults?
01. What is Data Lakes?
03. Friends or Foes?
04. Possibilities… Agenda
3
•Semi structured,
unstructured, raw
Schema on read Low cost storage
Agile and easy
reconfigure
Data scientist,
Experimentations
Data Lakes: Definition & Characteristics
Democratize Data
Supports all Data
formats
Schema flexibility
Advanced Analytics
Scalability
Data Visualization
Data Filtering
Machine Learning
Dashboards
Batch Processing
Interactive Processing
Data Lake
Sources
Consumers/
Analysts/
Reports/
Data Scientists
E(xtract) T(ransform)L(oad)
Hadoop, HDFS, S3, Spark, Databricks, e.t.c R, Pig, Solr, Hive, Presto, Tableau,
Definition: A data lake is a system or repository of data stored in its natural/raw format,
usually object blobs or files. A data lake is usually a single store of all enterprise data
including raw copies of source system data and transformed data used for tasks such as
reporting, visualization, advanced analytics and machine learning.
https://en.wikipedia.org/wiki/Data_lake
4Data Lakes: History & Evolution
2006: Amazon AWS Launches
2008: Yahoo Open Sources Hadoop **
2009: Cloudera Forms
2009: AWS Elastic MapReduce
2010: Apache Hive release
2010: John Dickson, coins the term Data Lake
2011: Horton Works Forms
2015: Snowflake released on AWS
2015: Hive and Presto released on AWS
2017: AWS Athena released
Democratize Data
Supports all Data
formats
Schema flexibility
Advanced Analytics
Scalability
5Data Vault (Modeling): Definition & Characteristics
Democratize Data
Supports all Data
formats
Schema flexibility
Advanced Analytics
Scalability
Data Vault modeling is a database modeling method that is designed to provide long-term
historical storage of data coming in from multiple operational systems. It is also a method
of looking at historical data that deals with issues such as auditing, tracing of data, loading
speed and resilience to change as well as emphasizing the need to trace where all the
data in the database came from.
Data Visualization
Data Filtering
Machine Learning
Dashboards
Batch Processing
Interactive Processing
Data Lake
Sources
Consumers/
Analysts/
Reports/
Data Scientists
E(xtract) T(ransform)L(oad)
Hadoop, HDFS, S3, Spark,
Databricks, e.t.c
R, Pig, Solr, Hive,
Presto, Tableau,
Data Vault
Modelling
/Harmonize
Hive, Snowflake,
BigQuery, Redshift,
Oracle, Synapse, e.t.c.
•Semi structured,
unstructured, raw
Schema on read Low cost storage
Agile and easy
reconfigure
Data scientist,
Experimentations
Data Science / Exploration
https://en.wikipedia.org/wiki/Data_vault_modeling
6Data Vault (Modeling): History & Evolution
1960s: E.F. Codd => 3NF
Bill Inmon Invents Data Warehouse
Dr. R. Kimball champions star schema
1990s: Conceived by Dan Linstedt
2000: DV 1.0 Released into public
domain
2014: DV 2.0 Announced
7
Data Vault (Modelling)
Sats (Satellites): These are the
complete source tables that contain
descriptive information and time
attributes so we can track changes and
do point-in-time analysis.
Hubs: These contain the
business keys and any metadata.
Nothing descriptive is written to a
Hub.
Links: Links connect one or more Hubs
together.
The Data Vault modelling is a technique used to store
source data at a more granular level. Generally, the data
is not changed in any way, other than to add load date
keys to track changes.
1) Instead of each master table in 3NF,
we add a hub and a satellite.
2) Instead of the transactional table, we
add Link table and Satellite.
3) Instead of the joins between master
tables, we add Link tables.
http://bukhantsov.org/2012/04/what-is-data-vault/
Dimensional Model
Data Vault Model
8
Data Visualization
Data Filtering
Machine Learning
Dashboards
Batch Processing
Interactive Processing
Data Lake
Sources
Consumers/
Analysts/
Reports/
Data Scientists
E(xtract) T(ransform)L(oad)
Hadoop, HDFS, S3, Spark, Databricks, e.t.c R, Pig, Solr, Hive, Presto,
Tableau,
Modelling
/Harmonize
RDBMS: Hive, Snowflake, BigQuery,
Redshift,
Oracle, Synapse, e.t.c.
Verdict: Data Vault vs Data Lakes?
Data Warehousing
Modeling Techniques
Data Vault
Modelling
Dimensional
Modelling
(3NF)
Others
Verdict: This
comparison is a
misnomer. Data Vaults
don’t compete with Data
Lakes. DV compliments
Data Lakes for better
analytics i.e.
Data Lakes + Data
Vault (Modelling)
Data Science / Exploration
Modelling
/Harmonize
Consumption
Thanks
Tech|Career|Inspiration
F I N I S H w w w. f r u l o u i s . c o m

Data Vault vs Data Lake: What's the difference?

  • 1.
    Data Vault VS Data Lake Co a c h F r u L o u i s | w w w. f r u l o u i s . c o m Tech|Career|Inspiration Data Basics
  • 2.
    2 Agenda @Coachfrulouis Tech|Career|Inspiration 02. What isData Vaults? 01. What is Data Lakes? 03. Friends or Foes? 04. Possibilities… Agenda
  • 3.
    3 •Semi structured, unstructured, raw Schemaon read Low cost storage Agile and easy reconfigure Data scientist, Experimentations Data Lakes: Definition & Characteristics Democratize Data Supports all Data formats Schema flexibility Advanced Analytics Scalability Data Visualization Data Filtering Machine Learning Dashboards Batch Processing Interactive Processing Data Lake Sources Consumers/ Analysts/ Reports/ Data Scientists E(xtract) T(ransform)L(oad) Hadoop, HDFS, S3, Spark, Databricks, e.t.c R, Pig, Solr, Hive, Presto, Tableau, Definition: A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. https://en.wikipedia.org/wiki/Data_lake
  • 4.
    4Data Lakes: History& Evolution 2006: Amazon AWS Launches 2008: Yahoo Open Sources Hadoop ** 2009: Cloudera Forms 2009: AWS Elastic MapReduce 2010: Apache Hive release 2010: John Dickson, coins the term Data Lake 2011: Horton Works Forms 2015: Snowflake released on AWS 2015: Hive and Presto released on AWS 2017: AWS Athena released Democratize Data Supports all Data formats Schema flexibility Advanced Analytics Scalability
  • 5.
    5Data Vault (Modeling):Definition & Characteristics Democratize Data Supports all Data formats Schema flexibility Advanced Analytics Scalability Data Vault modeling is a database modeling method that is designed to provide long-term historical storage of data coming in from multiple operational systems. It is also a method of looking at historical data that deals with issues such as auditing, tracing of data, loading speed and resilience to change as well as emphasizing the need to trace where all the data in the database came from. Data Visualization Data Filtering Machine Learning Dashboards Batch Processing Interactive Processing Data Lake Sources Consumers/ Analysts/ Reports/ Data Scientists E(xtract) T(ransform)L(oad) Hadoop, HDFS, S3, Spark, Databricks, e.t.c R, Pig, Solr, Hive, Presto, Tableau, Data Vault Modelling /Harmonize Hive, Snowflake, BigQuery, Redshift, Oracle, Synapse, e.t.c. •Semi structured, unstructured, raw Schema on read Low cost storage Agile and easy reconfigure Data scientist, Experimentations Data Science / Exploration https://en.wikipedia.org/wiki/Data_vault_modeling
  • 6.
    6Data Vault (Modeling):History & Evolution 1960s: E.F. Codd => 3NF Bill Inmon Invents Data Warehouse Dr. R. Kimball champions star schema 1990s: Conceived by Dan Linstedt 2000: DV 1.0 Released into public domain 2014: DV 2.0 Announced
  • 7.
    7 Data Vault (Modelling) Sats(Satellites): These are the complete source tables that contain descriptive information and time attributes so we can track changes and do point-in-time analysis. Hubs: These contain the business keys and any metadata. Nothing descriptive is written to a Hub. Links: Links connect one or more Hubs together. The Data Vault modelling is a technique used to store source data at a more granular level. Generally, the data is not changed in any way, other than to add load date keys to track changes. 1) Instead of each master table in 3NF, we add a hub and a satellite. 2) Instead of the transactional table, we add Link table and Satellite. 3) Instead of the joins between master tables, we add Link tables. http://bukhantsov.org/2012/04/what-is-data-vault/ Dimensional Model Data Vault Model
  • 8.
    8 Data Visualization Data Filtering MachineLearning Dashboards Batch Processing Interactive Processing Data Lake Sources Consumers/ Analysts/ Reports/ Data Scientists E(xtract) T(ransform)L(oad) Hadoop, HDFS, S3, Spark, Databricks, e.t.c R, Pig, Solr, Hive, Presto, Tableau, Modelling /Harmonize RDBMS: Hive, Snowflake, BigQuery, Redshift, Oracle, Synapse, e.t.c. Verdict: Data Vault vs Data Lakes? Data Warehousing Modeling Techniques Data Vault Modelling Dimensional Modelling (3NF) Others Verdict: This comparison is a misnomer. Data Vaults don’t compete with Data Lakes. DV compliments Data Lakes for better analytics i.e. Data Lakes + Data Vault (Modelling) Data Science / Exploration Modelling /Harmonize Consumption
  • 9.
    Thanks Tech|Career|Inspiration F I NI S H w w w. f r u l o u i s . c o m