SlideShare a Scribd company logo
Data Vault Automation at
de Bijenkorf
PRESENTED BY
ROB WINTERS
ANDREI SCORUS
Presentation agenda
◦ Project objectives
◦ Architectural overview
◦ The data warehouse data model
◦ Automation in the data warehouse
◦ Successes and failures
◦ Conclusions
About the presenters
Rob Winters
Head of Data Technology, the Bijenkorf
Project role:
◦ Project Lead
◦ Systems architect and administrator
◦ Data modeler
◦ Developer (ETL, predictive models, reports)
◦ Stakeholder manager
◦ Joined project September 2014
Andrei Scorus
BI Consultant, Incentro
Project role:
◦ Main ETL Developer
◦ ETL Developer
◦ Modeling support
◦ Source system expert
◦ Joined project November 2014
Project objectives
◦ Information requirements
◦ Have one place as the source for all reports
◦ Security and privacy
◦ Information management
◦ Integrate with production
◦ Non-functional requirements
◦ System quality
◦ Extensibility
◦ Scalability
◦ Maintainability
◦ Security
◦ Flexibility
◦ Low Cost
Technical Requirements
• One environment to quickly generate customer insights
• Then feed those insights back to production
• Then measure the impact of those changes in near real time
Source system landscape
Source Type Number of Sources Examples Load Frequency Data Structure
Oracle DB 2 Virgo ERP 2x/hour Partial 3NF
MySQL 3 Product DB, Web
Orders, DWH
10x/hour 3NF (Web Orders),
Improperly normalized
Event bus 1 Web/email events 1x/minute Tab delimited with
JSON fields
Webhook 1 Transactional Emails 1x/minute JSON
REST APIs 5+ GA, DotMailer 1x/hour-1x/day JSON
SOAP APIs 5+ AdWords, Pricing 1x/day XML
Architectural overview
Tools
AWS
◦ S3
◦ Kinesis
◦ Elasticache
◦ Elastic Beanstalk
◦ EC2
◦ DynamoDB
Open Source
◦ Snowplow Event Tracker
◦ Rundeck Scheduler
◦ Jenkins Continuous Integration
◦ Pentaho PDI
Other
◦ HP Vertica
◦ Tableau
◦ Github
◦ RStudio Server
DWH internal architecture
• Traditional three tier DWH
• ODS generated automatically from
staging
• Ops mart reflects data in original
source form
• Helps offload queries from
source systems
• Business marts materialized
exclusively from vault
Bijenkorf Data Vault overview
Data volumes
• ~1 TB base volume
• 10-12 GB daily
• ~250 source tables
Aligned to Data Vault 2.0
• Hash keys
• Hashes used for CDC
• Parallel loading
• Maximum utilization of available
resources
• Data unchanged in to the vault
Some statistics
18 hubs
• 34 loading scripts
27 links
• 43 loading scripts
39 satellites
• 43 loading scripts
13 reference tables
• 1 script per table
Model contains
• Sales transactions
• Customer and corporate
locations
• Customers
• Products
• Payment methods
• E-mail
• Phone
• Product grouping
• Campaigns
• deBijenkorf card
• Social media
Excluded from the vault
◦ Event streams
◦ Server logs
◦ Unstructured data
Deep dive: Transactions in DV
•Transactions
Deep dive: Customers in DV
•Same as link on customer
Challenges encountered during data modeling
Challenge Issue Details Resolution
Source issues • Source systems and original data
unavailable for most information
• Data often transformed 2-4 times before
access was available
• Business keys (ex. SKU) typically replaced
with sequences
• Business keys rebuilt in staging prior to
vault loading
Modeling returns • Retail returns can appear in ERP in 1-3
ways across multiple tables with
inconsistent keys
• Online returns appear as a state change
on original transaction and may/may not
appear in ERP
• Original model showed sale state on
line item satellite
• Revised model recorded “negative sale”
transactions and used a new link to
connect to original sale when possible
Fragmented
knowledge
• Information about the systems was being
held by multiple people
• Documentation was out-of-date
• Talking to as many people as possible
and testing hypotheses on the data
Targeted benefits of DWH automation
Objective Achievements
Speed of development • Integration of new sources or data from existing sources takes 1-2 steps
• Adding a new vault dependency takes one step
Simplicity • Five jobs handle all ETL processes across DWH
Traceability • Every record/source file is traced in the database and every row automatically
identified by source file in ODS
Code simplification • Replaced most common key definitions with dynamic variable replacement
File management • Every source file automatically archived to Amazon S3 in appropriate locations
sorted by source, table, and date
• Entire source systems, periods, etc can be replayed in minutes
Source loading automation
o Design of loader focused on process abstraction, traceability, and minimization of “moving parts”
o Final process consisted of two base jobs working in tandem: one for generating incremental extracts from
source systems, one for loading flat files from all sources to staging tables
o Replication was desired but rejected due to limited access to source systems
Source tables
duplicated in
staging with
addition of
loadTs and
sourceFile
columns
Metadata for
source file
added
Loader
automatically
generates ODS,
begins tracking
source files for
duplication and
data quality
Query
generator
automatically
executes full
duplication on
first execution
and
incrementals
afterward
CREATE TABLE stg_oms.customer
(
customerId int
, customerName varchar(500)
, customerAddress varchar(5000)
, loadTs timestamp NOT NULL
, sourceFile varchar(255) NOT NULL
)
ORDER BY customerId
PARTITION BY date(loadTs)
;
INSERT INTO meta.source_to_stg_mapping
(targetSchema, targetTable, sourceSystem, fileNamePattern, delimiter, nullField)
VALUES
('stg_oms','customer','OMS','OMS_CUSTOMER','TAB','NULL')
;
Example: Add additional table from existing sourceWorkflow of source integration
Vault loading automation
• New sources
automatically
added
• Last change
epoch based
on load
stamps,
advanced
each time all
dependencies
execute
successfully
All Staging
Tables
Checked for
Changes
• Dependencies
declared at
time of job
creation
• Load
prioritization
possible but
not utilized
List of
Dependent
Vault Loads
Identified
• Jobs
parallelized
across tables
but serialized
per job
• Dynamic job
queueing
ensures
appropriate
execution
order
Loads
Planned in
Hub, Link,
Sat Order
• Variables
automatically
identified and
replaced
• Each load
records
performance
statistics and
error
messages
Loads
Executed
o Loader is fully metadata driven with focus on horizontal scalability and management simplicity
o To support speed of development and performance, variable-driven SQL templates used throughout
Design goals for mart loading automation
Requirement Solution Benefit
Simple,
standardized
models
Metadata-driven
Pentaho PDI
Easy development
using parameters
and variables
Easily
Extensible
Plugin framework
Rapid integration
of new
functionality
Rapid new job
development
Recycle
standardized jobs
and
transformations
Limited moving
parts, easy
modification
Low
administration
overhead
Leverage built in
logging and
tracking
Easily integrated
mart loading
reporting with
other ETL reports
Data Information mart automation flow
Retrieve
commands
• Each dimension and fact is processed independently
Get
dependencies
• Based on defined transformation, get all related vault tables: links, satellites or hubs
Retrieve
changed data
• From the related tables, build a list of unique keys that have changed since the last update of the fact or dimension
• Store the data in the database until further processing
Execute
transformations
• Multiple Pentaho transformations can be processed per command using the data captured in previous steps
Maintentance
• Logging happens throughout the whole process
• Cleanup after all commands have been processed
Primary uses of Bijenkorf DWH
CustomerAnalysis
• Provided first unified
data model of
customer activity
• 80% reduction in
unique customer keys
• Allowed for
segmentation of
customers based on
combination of in-
store and online
activity
Personalization
• DV drives
recommendation
engine and customer
recommendations
(updated nightly)
• Data pipeline
supports near real
time updating of
customer
recommendations
based on web activity
BusinessIntelligence
• DV-based marts
replace joining dozens
of tables across
multiple sources with
single facts/
dimensions
• IT-driven reporting
being replaced with
self-service BI
Biggest drivers of success
AWS Infrastructure
Cost: Entire infrastructure for less than one
server in the data center
Toolset: Most services available off the
shelf, minimizing administration
Freedom: No dependency on IT for
development support
Scalability: Systems automatically scaled to
match DWH demands
Automation
Speed: Enormous time savings after initial
investment
Simplicity: Able to run and monitor 40k+
queries per day with minimal effort
Auditability: Enforced tracking and archiving
without developer involvement
PDI framework
Ease of use: Adding new commands takes at
most 45 minutes
Agile: Building the framework took 1 day
Low profile: Average memory usage of
250MB
Biggest mistakes along the way
• Initial integration design was based on provided documentation/models which was rarely accurate
• Current users of sources should have been engaged earlier to explain undocumented caveats
Reliance on documentation and requirements over expert users
• Variables were utilized late in development, slowing progress significantly and creating consistency
issues
• Good initial design of templates will significantly reduce development time in mid/long run
Late utilization of templates and variables
• We attempted to design and populate the entire data vault prior to focusing on customer deliverables
like reports (in addition to other projects)
• We have shifted focus to continuous release of new information rather than waiting for completeness
Aggressive overextension of resources
Primary takeaways
◦ Sources are like cars: the older they are, the more idiosyncrasies. Be cautious with design automation!
◦ Automation can enormously simplify/accelerate data warehousing. Don’t be afraid to roll your own
◦ Balance stateful versus stateless and monolithic versus fragmented architecture design
◦ Cloud based architecture based on column store DBs is extremely scalable, cheap, and highly performant
◦ A successful vault can create a new problem: getting IT to think about business processes rather than system keys!
Rob Winters
WintersRD@gmail.com
Andrei Scorus
andrei.scorus@incentro.com

More Related Content

What's hot

How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Durga Gadiraju
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Caserta
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
Mark Kromer
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)
Kent Graziano
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data Lake
Vasu S
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
Eduardo Castro
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
Ike Ellis
 
Machine Learning for z/OS
Machine Learning for z/OSMachine Learning for z/OS
Machine Learning for z/OS
Cuneyt Goksu
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
Kent Graziano
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
James Serra
 
2022 02 Integration Bootcamp
2022 02 Integration Bootcamp2022 02 Integration Bootcamp
2022 02 Integration Bootcamp
Michael Stephenson
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Disaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQLDisaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQL
Syed Jahanzaib Bin Hassan - JBH Syed
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
James Serra
 

What's hot (20)

How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data Lake
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
Machine Learning for z/OS
Machine Learning for z/OSMachine Learning for z/OS
Machine Learning for z/OS
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
2022 02 Integration Bootcamp
2022 02 Integration Bootcamp2022 02 Integration Bootcamp
2022 02 Integration Bootcamp
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Disaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQLDisaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQL
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 

Viewers also liked

Building a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine LearningBuilding a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine Learning
Rob Winters
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
 
Architecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data AnalyticsArchitecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data Analytics
Rob Winters
 
Guru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best PracticesGuru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best Practices
CGI
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
Amazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
Amazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
Amazon Web Services
 
Ibm integration bus
Ibm integration busIbm integration bus
Ibm integration bus
FuturePoint Technologies
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
Rob Winters
 
Top bi travelbird
Top bi travelbirdTop bi travelbird
Top bi travelbird
BigDataExpo
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big Data
Rob Winters
 
Getting Started with Big Data Analytics
Getting Started with Big Data AnalyticsGetting Started with Big Data Analytics
Getting Started with Big Data Analytics
Rob Winters
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right Now
Rob Winters
 
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
BigDataExpo
 
Semantic Technology for the Data Warehousing Practitioner
Semantic Technology for the Data Warehousing PractitionerSemantic Technology for the Data Warehousing Practitioner
Semantic Technology for the Data Warehousing Practitioner
Thomas Kelly, PMP
 
Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)
Michael Olschimke
 
Techzone 2014 presentation rundeck
Techzone 2014 presentation rundeckTechzone 2014 presentation rundeck
Techzone 2014 presentation rundeck
Joel Richard Moya Lupe
 
Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12
todmoore
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil Games
Rob Winters
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-design
Sarita Kataria
 

Viewers also liked (20)

Building a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine LearningBuilding a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine Learning
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Architecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data AnalyticsArchitecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data Analytics
 
Guru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best PracticesGuru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best Practices
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Ibm integration bus
Ibm integration busIbm integration bus
Ibm integration bus
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Top bi travelbird
Top bi travelbirdTop bi travelbird
Top bi travelbird
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big Data
 
Getting Started with Big Data Analytics
Getting Started with Big Data AnalyticsGetting Started with Big Data Analytics
Getting Started with Big Data Analytics
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right Now
 
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
 
Semantic Technology for the Data Warehousing Practitioner
Semantic Technology for the Data Warehousing PractitionerSemantic Technology for the Data Warehousing Practitioner
Semantic Technology for the Data Warehousing Practitioner
 
Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)
 
Techzone 2014 presentation rundeck
Techzone 2014 presentation rundeckTechzone 2014 presentation rundeck
Techzone 2014 presentation rundeck
 
Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil Games
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-design
 

Similar to Data Vault Automation at the Bijenkorf

Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts
 
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHow a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
HostedbyConfluent
 
Datawarehouse org
Datawarehouse orgDatawarehouse org
Datawarehouse org
Shwetabh Jaiswal
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
Cloudera, Inc.
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Victor Holman
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
Saurabh K. Gupta
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
RafiulHasan19
 
Bringing DevOps to the Database
Bringing DevOps to the DatabaseBringing DevOps to the Database
Bringing DevOps to the Database
Michaela Murray
 
Evolutionary database design
Evolutionary database designEvolutionary database design
Evolutionary database design
Salehein Syed
 
Introduction to Conductor
Introduction to ConductorIntroduction to Conductor
Introduction to Conductor
Jason Gleason
 
Delivering Changes for Applications and Databases
Delivering Changes for Applications and DatabasesDelivering Changes for Applications and Databases
Delivering Changes for Applications and Databases
Miguel Alho
 
Data Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDCData Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDC
Abhijit Kumar
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
Amazon Web Services
 
Fishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter AutomationFishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter Automation
Fishbowl Solutions
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data WarehousingDatastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
Vibrant Technologies & Computers
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns Edition
Maggie Pint
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
RahulSingh986955
 
Ibm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_Capabilities
IBM_Info_Management
 
DevOps+Data: Working with Source Control
DevOps+Data: Working with Source ControlDevOps+Data: Working with Source Control
DevOps+Data: Working with Source Control
Ed Leighton-Dick
 
AppSphere 15 - Is the database affecting your critical business transactions?
AppSphere 15 - Is the database affecting your critical business transactions?AppSphere 15 - Is the database affecting your critical business transactions?
AppSphere 15 - Is the database affecting your critical business transactions?
AppDynamics
 

Similar to Data Vault Automation at the Bijenkorf (20)

Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New Features
 
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHow a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
 
Datawarehouse org
Datawarehouse orgDatawarehouse org
Datawarehouse org
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
 
Bringing DevOps to the Database
Bringing DevOps to the DatabaseBringing DevOps to the Database
Bringing DevOps to the Database
 
Evolutionary database design
Evolutionary database designEvolutionary database design
Evolutionary database design
 
Introduction to Conductor
Introduction to ConductorIntroduction to Conductor
Introduction to Conductor
 
Delivering Changes for Applications and Databases
Delivering Changes for Applications and DatabasesDelivering Changes for Applications and Databases
Delivering Changes for Applications and Databases
 
Data Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDCData Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDC
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
 
Fishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter AutomationFishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter Automation
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data WarehousingDatastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns Edition
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
Ibm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_Capabilities
 
DevOps+Data: Working with Source Control
DevOps+Data: Working with Source ControlDevOps+Data: Working with Source Control
DevOps+Data: Working with Source Control
 
AppSphere 15 - Is the database affecting your critical business transactions?
AppSphere 15 - Is the database affecting your critical business transactions?AppSphere 15 - Is the database affecting your critical business transactions?
AppSphere 15 - Is the database affecting your critical business transactions?
 

Recently uploaded

Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
revolutionary575
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
ginni singh$A17
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
LINAT
 
ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
MinThetLwin1
 
M44.pdf dairy management farm report of an
M44.pdf dairy management farm report of anM44.pdf dairy management farm report of an
M44.pdf dairy management farm report of an
ManjuBv2
 
Biometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdfBiometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdf
Joel Ngushwai
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
satpalsheravatmumbai
 
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
revolutionary575
 
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy DsouzaOpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
ginni singh$A17
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
AnujaGaikwad28
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
vrvipin164
 
Nipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma TranscriptNipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma Transcript
zyqedad
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
GaneshGanesh399816
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
Kanchana Weerasinghe
 
potential development of the A* search algorithm specifically
potential development of the A* search algorithm specificallypotential development of the A* search algorithm specifically
potential development of the A* search algorithm specifically
huseindihon
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
bhupeshkumar0889
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
harendmgr
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
palanisamyiiiier
 
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDeliveryBDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
erynsouthern
 

Recently uploaded (20)

Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
 
ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
 
M44.pdf dairy management farm report of an
M44.pdf dairy management farm report of anM44.pdf dairy management farm report of an
M44.pdf dairy management farm report of an
 
Biometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdfBiometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdf
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
 
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
 
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy DsouzaOpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
 
Nipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma TranscriptNipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma Transcript
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
 
potential development of the A* search algorithm specifically
potential development of the A* search algorithm specificallypotential development of the A* search algorithm specifically
potential development of the A* search algorithm specifically
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
 
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDeliveryBDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
 

Data Vault Automation at the Bijenkorf

  • 1. Data Vault Automation at de Bijenkorf PRESENTED BY ROB WINTERS ANDREI SCORUS
  • 2. Presentation agenda ◦ Project objectives ◦ Architectural overview ◦ The data warehouse data model ◦ Automation in the data warehouse ◦ Successes and failures ◦ Conclusions
  • 3. About the presenters Rob Winters Head of Data Technology, the Bijenkorf Project role: ◦ Project Lead ◦ Systems architect and administrator ◦ Data modeler ◦ Developer (ETL, predictive models, reports) ◦ Stakeholder manager ◦ Joined project September 2014 Andrei Scorus BI Consultant, Incentro Project role: ◦ Main ETL Developer ◦ ETL Developer ◦ Modeling support ◦ Source system expert ◦ Joined project November 2014
  • 4. Project objectives ◦ Information requirements ◦ Have one place as the source for all reports ◦ Security and privacy ◦ Information management ◦ Integrate with production ◦ Non-functional requirements ◦ System quality ◦ Extensibility ◦ Scalability ◦ Maintainability ◦ Security ◦ Flexibility ◦ Low Cost Technical Requirements • One environment to quickly generate customer insights • Then feed those insights back to production • Then measure the impact of those changes in near real time
  • 5. Source system landscape Source Type Number of Sources Examples Load Frequency Data Structure Oracle DB 2 Virgo ERP 2x/hour Partial 3NF MySQL 3 Product DB, Web Orders, DWH 10x/hour 3NF (Web Orders), Improperly normalized Event bus 1 Web/email events 1x/minute Tab delimited with JSON fields Webhook 1 Transactional Emails 1x/minute JSON REST APIs 5+ GA, DotMailer 1x/hour-1x/day JSON SOAP APIs 5+ AdWords, Pricing 1x/day XML
  • 6. Architectural overview Tools AWS ◦ S3 ◦ Kinesis ◦ Elasticache ◦ Elastic Beanstalk ◦ EC2 ◦ DynamoDB Open Source ◦ Snowplow Event Tracker ◦ Rundeck Scheduler ◦ Jenkins Continuous Integration ◦ Pentaho PDI Other ◦ HP Vertica ◦ Tableau ◦ Github ◦ RStudio Server
  • 7. DWH internal architecture • Traditional three tier DWH • ODS generated automatically from staging • Ops mart reflects data in original source form • Helps offload queries from source systems • Business marts materialized exclusively from vault
  • 8. Bijenkorf Data Vault overview Data volumes • ~1 TB base volume • 10-12 GB daily • ~250 source tables Aligned to Data Vault 2.0 • Hash keys • Hashes used for CDC • Parallel loading • Maximum utilization of available resources • Data unchanged in to the vault Some statistics 18 hubs • 34 loading scripts 27 links • 43 loading scripts 39 satellites • 43 loading scripts 13 reference tables • 1 script per table Model contains • Sales transactions • Customer and corporate locations • Customers • Products • Payment methods • E-mail • Phone • Product grouping • Campaigns • deBijenkorf card • Social media Excluded from the vault ◦ Event streams ◦ Server logs ◦ Unstructured data
  • 9. Deep dive: Transactions in DV •Transactions
  • 10. Deep dive: Customers in DV •Same as link on customer
  • 11. Challenges encountered during data modeling Challenge Issue Details Resolution Source issues • Source systems and original data unavailable for most information • Data often transformed 2-4 times before access was available • Business keys (ex. SKU) typically replaced with sequences • Business keys rebuilt in staging prior to vault loading Modeling returns • Retail returns can appear in ERP in 1-3 ways across multiple tables with inconsistent keys • Online returns appear as a state change on original transaction and may/may not appear in ERP • Original model showed sale state on line item satellite • Revised model recorded “negative sale” transactions and used a new link to connect to original sale when possible Fragmented knowledge • Information about the systems was being held by multiple people • Documentation was out-of-date • Talking to as many people as possible and testing hypotheses on the data
  • 12. Targeted benefits of DWH automation Objective Achievements Speed of development • Integration of new sources or data from existing sources takes 1-2 steps • Adding a new vault dependency takes one step Simplicity • Five jobs handle all ETL processes across DWH Traceability • Every record/source file is traced in the database and every row automatically identified by source file in ODS Code simplification • Replaced most common key definitions with dynamic variable replacement File management • Every source file automatically archived to Amazon S3 in appropriate locations sorted by source, table, and date • Entire source systems, periods, etc can be replayed in minutes
  • 13. Source loading automation o Design of loader focused on process abstraction, traceability, and minimization of “moving parts” o Final process consisted of two base jobs working in tandem: one for generating incremental extracts from source systems, one for loading flat files from all sources to staging tables o Replication was desired but rejected due to limited access to source systems Source tables duplicated in staging with addition of loadTs and sourceFile columns Metadata for source file added Loader automatically generates ODS, begins tracking source files for duplication and data quality Query generator automatically executes full duplication on first execution and incrementals afterward CREATE TABLE stg_oms.customer ( customerId int , customerName varchar(500) , customerAddress varchar(5000) , loadTs timestamp NOT NULL , sourceFile varchar(255) NOT NULL ) ORDER BY customerId PARTITION BY date(loadTs) ; INSERT INTO meta.source_to_stg_mapping (targetSchema, targetTable, sourceSystem, fileNamePattern, delimiter, nullField) VALUES ('stg_oms','customer','OMS','OMS_CUSTOMER','TAB','NULL') ; Example: Add additional table from existing sourceWorkflow of source integration
  • 14. Vault loading automation • New sources automatically added • Last change epoch based on load stamps, advanced each time all dependencies execute successfully All Staging Tables Checked for Changes • Dependencies declared at time of job creation • Load prioritization possible but not utilized List of Dependent Vault Loads Identified • Jobs parallelized across tables but serialized per job • Dynamic job queueing ensures appropriate execution order Loads Planned in Hub, Link, Sat Order • Variables automatically identified and replaced • Each load records performance statistics and error messages Loads Executed o Loader is fully metadata driven with focus on horizontal scalability and management simplicity o To support speed of development and performance, variable-driven SQL templates used throughout
  • 15. Design goals for mart loading automation Requirement Solution Benefit Simple, standardized models Metadata-driven Pentaho PDI Easy development using parameters and variables Easily Extensible Plugin framework Rapid integration of new functionality Rapid new job development Recycle standardized jobs and transformations Limited moving parts, easy modification Low administration overhead Leverage built in logging and tracking Easily integrated mart loading reporting with other ETL reports
  • 16. Data Information mart automation flow Retrieve commands • Each dimension and fact is processed independently Get dependencies • Based on defined transformation, get all related vault tables: links, satellites or hubs Retrieve changed data • From the related tables, build a list of unique keys that have changed since the last update of the fact or dimension • Store the data in the database until further processing Execute transformations • Multiple Pentaho transformations can be processed per command using the data captured in previous steps Maintentance • Logging happens throughout the whole process • Cleanup after all commands have been processed
  • 17. Primary uses of Bijenkorf DWH CustomerAnalysis • Provided first unified data model of customer activity • 80% reduction in unique customer keys • Allowed for segmentation of customers based on combination of in- store and online activity Personalization • DV drives recommendation engine and customer recommendations (updated nightly) • Data pipeline supports near real time updating of customer recommendations based on web activity BusinessIntelligence • DV-based marts replace joining dozens of tables across multiple sources with single facts/ dimensions • IT-driven reporting being replaced with self-service BI
  • 18. Biggest drivers of success AWS Infrastructure Cost: Entire infrastructure for less than one server in the data center Toolset: Most services available off the shelf, minimizing administration Freedom: No dependency on IT for development support Scalability: Systems automatically scaled to match DWH demands Automation Speed: Enormous time savings after initial investment Simplicity: Able to run and monitor 40k+ queries per day with minimal effort Auditability: Enforced tracking and archiving without developer involvement PDI framework Ease of use: Adding new commands takes at most 45 minutes Agile: Building the framework took 1 day Low profile: Average memory usage of 250MB
  • 19. Biggest mistakes along the way • Initial integration design was based on provided documentation/models which was rarely accurate • Current users of sources should have been engaged earlier to explain undocumented caveats Reliance on documentation and requirements over expert users • Variables were utilized late in development, slowing progress significantly and creating consistency issues • Good initial design of templates will significantly reduce development time in mid/long run Late utilization of templates and variables • We attempted to design and populate the entire data vault prior to focusing on customer deliverables like reports (in addition to other projects) • We have shifted focus to continuous release of new information rather than waiting for completeness Aggressive overextension of resources
  • 20. Primary takeaways ◦ Sources are like cars: the older they are, the more idiosyncrasies. Be cautious with design automation! ◦ Automation can enormously simplify/accelerate data warehousing. Don’t be afraid to roll your own ◦ Balance stateful versus stateless and monolithic versus fragmented architecture design ◦ Cloud based architecture based on column store DBs is extremely scalable, cheap, and highly performant ◦ A successful vault can create a new problem: getting IT to think about business processes rather than system keys!

Editor's Notes

  1. One of the focus points will be the return satellite, maybe the whole link to the return location and customer should have been modeled as a link? Return satellite is an active satellite