25/04/2019
Agenda
• Who we are
• Data Vault 2.0
• Pivotal Greenplum
• Reference cases
• Q&A
Value Proposition
• Focus on Next generation data integration
• (Enterprise) Open Source
• PostgreSQL & Pivotal Greenplum
• Data Integration Automation
• Data Vault 2.0
• Reference Architecture
• A to Z data integration projects
Data Vault 2.0
• Data modeling technique and methodology providing historical data representation from multiple sources
designed to be resilient to environmental changes
• History
• Data Vault is originally conceived by Daniel Linstedt in 1990
• Released in 2000
• Data Vault 2.0 arrived on the scene in 2013 and incorporates seamless integration of Big Data technologies
along with methodology, architecture and best practice implementations
• The Data Vault model is comprised of three basic table types:
• HUB : (natural) business key
• LINK : (natural) business relationship
• SATELLITE : all context, descriptive data and history
• Finally a standard
• Can be audited and certified
• From theory to implementation
• Bridging the gap between classic data warehouse & big data
• Architecture guidelines
• Project guidelines
• Since Data Vault 2.0 follows a well prescribed set of rules, it can be automated
H_CUSTOMER
Date/Time Stamp
Customer Code
H_CUSTOMER_SID
Record Source
L_CUST_CLASS
Date/Time Stamp
H_CUST_CLASS_SID
L_CUST_CLASS_SID
Record Source
H_CUSTOMER_SID
S_CUSTOMER
Context A
End Date/Time Stamp
H_CUSTOMER_SID
Context B
Date/Time Stamp
Context C
Record Source
H_CUST_CLASS
Date/Time Stamp
Customer Class Code
H_CUSTOMER_SID
Record Source
S_CUST_CLASS
Context A
End Date/Time Stamp
H_CUST_CLASS_SID
Context B
Date/Time Stamp
Context C
Record Source
Data Vault 2.0
Decoupling structure from description
Satellites are containing all descriptive
information
Hubs and links form the structural
backbone
Reference Architecture & Approach
DATA SOURCES
DATA INTEGRATION LAYER
L S
H L HH L
SS
H L HL H L HL
S
SS
S S
S
S S S S
H
L S
B
P
P
P
B
P
P
P
Raw Data Vault
Business Data Vault
Data Vault
Automation
PRESENTATION LAYER
Source
Analysis
Generate
Deploy
DATA LAB
ACCESS LAYER
META
DATA
&
DATA
QUALITY
WEB
SERVICES
APPLICATIONS
Business Value of the Structure Data Lake:
• Rapid Time to Deliver (2 or 3 week delivery
cycles)
• Lower Cost of Maintenance and
Management by applying automation
• Dynamic Structure Adaptation
• Data is 100% traceable to it’s origin and fully
represents the source data at any given
point in time.
• Spot business problems that were never
visible previously
• All changes of source data is recorded
• Scale to hundreds of Terabytes or Petabytes
• Data Scientists don’t need to do data
integration anymore and can focus on
delivering business value.
• 1 single & integrated version of the facts
usable for all Data Scientists.
• Data Vault 2.0 is a methodology with a
proven track record.
Confluent
Schema RegistryFilesDatabases
Why Data Vault 2.0?
• Supports the incremental build approach (Extensible without impacting the existing)
• Addition of new information domains / extensions on existing information domains without impact on:
• Existing Data Vault model
• Existing Data Marts
• Existing ETL scripts
• Existing source systems
• Existing Reporting & BI functions
S
Phase 1
S
H L
H
S
L
Phase 2
H
L
H
S
S
H S
Why Data Vault 2.0?
• Simplify & Accelerate the Build of an EDW (limited # templates and ELT-code can be generated)
Hub ELT
Satellite ELT
Link ELT
Why Data Vault 2.0?
• Scalability (Unlimited level of parallelism, no dependencies)
• No loading sequence
• No interdependencies
• All Hubs, Satellites and Links can be loaded in parallel
Unlimited level of parallelism
ETL Process Flow
Load HUB 1
Load HUB X
Load SAT 1
Load SAT Y
Load LNK 1
Load LNK Z
Why Data Vault 2.0?
• Traceable - Auditable
• Audit id’s, Creation dates, etc… are a standard part of the model
• Model changes are also recorded in a change of history fashion so full auditability of the model over time is possible.
• Completeness (atomic, all historic data) – Data Recorder
• The standard approach is to store all data at the lowest level of granularity and record all history of change for all
attributes.
• Resilient to change
• New Relationship to new entity
• No Impact on existing structure
• New Hub product and a Link Table with a Satellite on the Link
• Relationship cardinality change (from 1-1 to 1-N to N-M)
• No Impact on existing structure
• As a relationship using a Link table always stores the data in a structures that can store an N-M relationship, this type of change will
have no impact.
• New Source for same reference data (example customer)
• No Impact on existing structure
• New Satellite to every Hub for which data in the new source is available
• New Attributes
• No Impact on existing structure
• New Satellite that will contain these new attributes
Data Vault 2.0 Automation
Dan Linstedt
“Businesses are looking to do things better, faster and cheaper, and you really can’t do that
without automating or generating your solutions in your data warehousing system.
In order for your information delivery team to be agile and provide the data downstream quicker, you
need a metadata-driven automation tool.”
Multiple Speed Approach
• The Raw and the Business Data Vault area can be built at different speeds and in 2 separate teams.
• The RAW or Source based Data Vault is a technical implementation based on source systems and only requires a source
analysis = Single version of the facts.
The Foundation Accelerator 4.0 will be used for the generation of this layer.
• The Business Data Vault is :
• A Business based implementation that requires business, functional and technical analysis to understand business requirements
= Single or multiple versions of the truth
• The introduction of Bridge and PIT tables supports the implementation of the Virtualized Presentation Layer
• The multiple speed approach supports better business, functional and technical analysis when the raw data vault data is
already available.
Pivotal Greenplum
• Launched in 2005
• EMC Acquired in 2010
• Pivotal Acquired in 2013
• Massively Parallel Processing RDBMS
• MPP Shared Nothing Architecture
Each node is independent and self-sufficient - Each node has its own CPU, disk and memory.
• Data Analytics Platform
• Open Source Core Based on PostgreSQL
• Over 1000 Person Years of R&D Invested
• Hundreds of Global Customers in 34 countries
MPP
Pivotal Greenplum
Pivotal Greenplum - Architecture
Querying
Loading
ELT
Master Hosts
Interconnect
Switches
Segment Hosts
Segments
Mirrors
Pivotal Greenplum – Expansion
ELT
Add new hosts
Greenplum restart
Data Redistribution
1
2
3
Increase theIncrease the
power and
storage
UnavailibilityUnavailibility
less than 1
minute
RedistributionRedistribution
without closing
DB
Pivotal Greenplum – Mirroring (Normal Situation)
GPDB
GPDB
GPDB
GPDB
P1 P2 P3 M6 M8 M10
P4 P5 P6 M1M1 M9 M11
P7 P8 P9 M2M2 M4 M12
P10 P11 P12 M3M3 M5 M7
Segment
Server 1
Segment
Server 2
Segment
Server 3
Segment
Server 4
Set of Active Segment
Instances
Pivotal Greenplum – Mirroring (Los of 1 host)
GPDB
GPDB
GPDB
GPDB
P1P1 P2P2 P3P3 M6 M8 M10
P4 P5 P6 M1 M9 M11
P7 P8 P9 M2 M4 M12
P10 P11 P12 M3 M5 M7
Segment
Server 1
Segment
Server 2
Segment
Server 3
Segment
Server 4
Set of Active Segment
Instances
Pivotal Greenplum Runs Analytics Anywhere
• Public Clouds
• VMware Based Solutions
• Generic Bare Metal
Methodology
Consistent
Repeatable
Pattern Based
 implementation
Architecture
Multi-Tier
Scalable
Model
Flexible
Scalable (Big Data)
Hub & Spoke
Enterprise Open Source
PostgreSQL variants
Talend
Vlaamse Milieu Maatschappij
• Installation of a Pivotal Greenplum platform
• Implementation of the Data Vault 2.0 data integration layer for self-service BI and Data Science
Agentschap Zorg & Gezondheid
• Pivotal Greenplum platform on AWS
• Implementation of the Data Vault 2.0 data integration layer for Data Science
• Correlation between air pollution and mortality
• Correlation between heat and mortality
Thank you for your attention!
Bart Gielen
Managing Partner
+32 479 81 68 83
bart.gielen@datasense.be
www.datasense.be
www.linkedin.com/in/bartgielen-datasense

Meetup 25/04/19: Big Data

  • 1.
  • 2.
    Agenda • Who weare • Data Vault 2.0 • Pivotal Greenplum • Reference cases • Q&A
  • 3.
    Value Proposition • Focuson Next generation data integration • (Enterprise) Open Source • PostgreSQL & Pivotal Greenplum • Data Integration Automation • Data Vault 2.0 • Reference Architecture • A to Z data integration projects
  • 4.
    Data Vault 2.0 •Data modeling technique and methodology providing historical data representation from multiple sources designed to be resilient to environmental changes • History • Data Vault is originally conceived by Daniel Linstedt in 1990 • Released in 2000 • Data Vault 2.0 arrived on the scene in 2013 and incorporates seamless integration of Big Data technologies along with methodology, architecture and best practice implementations • The Data Vault model is comprised of three basic table types: • HUB : (natural) business key • LINK : (natural) business relationship • SATELLITE : all context, descriptive data and history • Finally a standard • Can be audited and certified • From theory to implementation • Bridging the gap between classic data warehouse & big data • Architecture guidelines • Project guidelines • Since Data Vault 2.0 follows a well prescribed set of rules, it can be automated
  • 5.
    H_CUSTOMER Date/Time Stamp Customer Code H_CUSTOMER_SID RecordSource L_CUST_CLASS Date/Time Stamp H_CUST_CLASS_SID L_CUST_CLASS_SID Record Source H_CUSTOMER_SID S_CUSTOMER Context A End Date/Time Stamp H_CUSTOMER_SID Context B Date/Time Stamp Context C Record Source H_CUST_CLASS Date/Time Stamp Customer Class Code H_CUSTOMER_SID Record Source S_CUST_CLASS Context A End Date/Time Stamp H_CUST_CLASS_SID Context B Date/Time Stamp Context C Record Source Data Vault 2.0 Decoupling structure from description Satellites are containing all descriptive information Hubs and links form the structural backbone
  • 6.
    Reference Architecture &Approach DATA SOURCES DATA INTEGRATION LAYER L S H L HH L SS H L HL H L HL S SS S S S S S S S H L S B P P P B P P P Raw Data Vault Business Data Vault Data Vault Automation PRESENTATION LAYER Source Analysis Generate Deploy DATA LAB ACCESS LAYER META DATA & DATA QUALITY WEB SERVICES APPLICATIONS Business Value of the Structure Data Lake: • Rapid Time to Deliver (2 or 3 week delivery cycles) • Lower Cost of Maintenance and Management by applying automation • Dynamic Structure Adaptation • Data is 100% traceable to it’s origin and fully represents the source data at any given point in time. • Spot business problems that were never visible previously • All changes of source data is recorded • Scale to hundreds of Terabytes or Petabytes • Data Scientists don’t need to do data integration anymore and can focus on delivering business value. • 1 single & integrated version of the facts usable for all Data Scientists. • Data Vault 2.0 is a methodology with a proven track record. Confluent Schema RegistryFilesDatabases
  • 7.
    Why Data Vault2.0? • Supports the incremental build approach (Extensible without impacting the existing) • Addition of new information domains / extensions on existing information domains without impact on: • Existing Data Vault model • Existing Data Marts • Existing ETL scripts • Existing source systems • Existing Reporting & BI functions S Phase 1 S H L H S L Phase 2 H L H S S H S
  • 8.
    Why Data Vault2.0? • Simplify & Accelerate the Build of an EDW (limited # templates and ELT-code can be generated) Hub ELT Satellite ELT Link ELT
  • 9.
    Why Data Vault2.0? • Scalability (Unlimited level of parallelism, no dependencies) • No loading sequence • No interdependencies • All Hubs, Satellites and Links can be loaded in parallel Unlimited level of parallelism ETL Process Flow Load HUB 1 Load HUB X Load SAT 1 Load SAT Y Load LNK 1 Load LNK Z
  • 10.
    Why Data Vault2.0? • Traceable - Auditable • Audit id’s, Creation dates, etc… are a standard part of the model • Model changes are also recorded in a change of history fashion so full auditability of the model over time is possible. • Completeness (atomic, all historic data) – Data Recorder • The standard approach is to store all data at the lowest level of granularity and record all history of change for all attributes. • Resilient to change • New Relationship to new entity • No Impact on existing structure • New Hub product and a Link Table with a Satellite on the Link • Relationship cardinality change (from 1-1 to 1-N to N-M) • No Impact on existing structure • As a relationship using a Link table always stores the data in a structures that can store an N-M relationship, this type of change will have no impact. • New Source for same reference data (example customer) • No Impact on existing structure • New Satellite to every Hub for which data in the new source is available • New Attributes • No Impact on existing structure • New Satellite that will contain these new attributes
  • 11.
    Data Vault 2.0Automation Dan Linstedt “Businesses are looking to do things better, faster and cheaper, and you really can’t do that without automating or generating your solutions in your data warehousing system. In order for your information delivery team to be agile and provide the data downstream quicker, you need a metadata-driven automation tool.”
  • 12.
    Multiple Speed Approach •The Raw and the Business Data Vault area can be built at different speeds and in 2 separate teams. • The RAW or Source based Data Vault is a technical implementation based on source systems and only requires a source analysis = Single version of the facts. The Foundation Accelerator 4.0 will be used for the generation of this layer. • The Business Data Vault is : • A Business based implementation that requires business, functional and technical analysis to understand business requirements = Single or multiple versions of the truth • The introduction of Bridge and PIT tables supports the implementation of the Virtualized Presentation Layer • The multiple speed approach supports better business, functional and technical analysis when the raw data vault data is already available.
  • 13.
    Pivotal Greenplum • Launchedin 2005 • EMC Acquired in 2010 • Pivotal Acquired in 2013 • Massively Parallel Processing RDBMS • MPP Shared Nothing Architecture Each node is independent and self-sufficient - Each node has its own CPU, disk and memory. • Data Analytics Platform • Open Source Core Based on PostgreSQL • Over 1000 Person Years of R&D Invested • Hundreds of Global Customers in 34 countries MPP
  • 14.
  • 15.
    Pivotal Greenplum -Architecture Querying Loading ELT Master Hosts Interconnect Switches Segment Hosts Segments Mirrors
  • 16.
    Pivotal Greenplum –Expansion ELT Add new hosts Greenplum restart Data Redistribution 1 2 3 Increase theIncrease the power and storage UnavailibilityUnavailibility less than 1 minute RedistributionRedistribution without closing DB
  • 17.
    Pivotal Greenplum –Mirroring (Normal Situation) GPDB GPDB GPDB GPDB P1 P2 P3 M6 M8 M10 P4 P5 P6 M1M1 M9 M11 P7 P8 P9 M2M2 M4 M12 P10 P11 P12 M3M3 M5 M7 Segment Server 1 Segment Server 2 Segment Server 3 Segment Server 4 Set of Active Segment Instances
  • 18.
    Pivotal Greenplum –Mirroring (Los of 1 host) GPDB GPDB GPDB GPDB P1P1 P2P2 P3P3 M6 M8 M10 P4 P5 P6 M1 M9 M11 P7 P8 P9 M2 M4 M12 P10 P11 P12 M3 M5 M7 Segment Server 1 Segment Server 2 Segment Server 3 Segment Server 4 Set of Active Segment Instances
  • 19.
    Pivotal Greenplum RunsAnalytics Anywhere • Public Clouds • VMware Based Solutions • Generic Bare Metal
  • 20.
  • 21.
    Vlaamse Milieu Maatschappij •Installation of a Pivotal Greenplum platform • Implementation of the Data Vault 2.0 data integration layer for self-service BI and Data Science
  • 22.
    Agentschap Zorg &Gezondheid • Pivotal Greenplum platform on AWS • Implementation of the Data Vault 2.0 data integration layer for Data Science • Correlation between air pollution and mortality • Correlation between heat and mortality
  • 24.
    Thank you foryour attention! Bart Gielen Managing Partner +32 479 81 68 83 bart.gielen@datasense.be www.datasense.be www.linkedin.com/in/bartgielen-datasense