DWH Data Integration
Christian Stade-Schuldt
Project-A Ventures
BI Team Knowledge Transfer
Outline
Motivation
Import
Data Quality
Perfomance
Monitoring
,
Project-A, DWH Data Integration, 2014 2
What is data integration?
combination of technical and business processes used
to combine data from disparate sources into meaningful
and valuable information
encompasses discovery, cleansing, monitoring,
transforming and delivery of data from a variety
of sources
by far the largest portion of building a data warehouse
,
Project-A, DWH Data Integration, 2014 3
The ETL Process
Extract data from homogeneous or heterogeneous data sources
Transform the data for storing it in proper format or structure for
querying and analysis purpose
Load it into the final target
,
Project-A, DWH Data Integration, 2014 4
Processes and Jobs
Process → Set of jobs in a
particular order
Different processes for
separation
can run at different time
intervals
File-dependency management
Visualize graph
,
Project-A, DWH Data Integration, 2014 5
Processes and Jobs
Job → Set of commands,
depend on other jobs
Command → Specific action
(eg. run sql file)
⇒ developer friendly (plain
text files)
,
Project-A, DWH Data Integration, 2014 6
Sources
Comma-separated files
JSON files
various databases (MySQL,
PostgreSQL, Microsoft SQL
Server)
via project code
external APIs (usually export to
csv via cronjob)
,
Project-A, DWH Data Integration, 2014 7
The Schema Life-Cycle
Data warehouse can be rebuild from scratch with every import
Import runs on a next schema
Switch schemata in the last step
Failure does not impact current data warehouse
,
Project-A, DWH Data Integration, 2014 8
Data Quality
Real-world data is dirty
Data quality is critical to data warehouse and business intelligence
solutions
Goal:
single point of truth
cleaned-up and validated data
easily accessable for user
,
Project-A, DWH Data Integration, 2014 9
Data Quality 2
Referential integrity → requires every value of
one attribute (column) of a relation (table)
to exist as a value of another attribute in a different
(or the same) relation (table)
Check constraints (ADD CHECK)
Unique constraints
Consistency checks → What goes in, has to come out,
No one’s left behind, some are. :(
,
Project-A, DWH Data Integration, 2014 10
Improving performance
Cost-based scheduling for jobs
(Priority Queue)
Incremental loads
Parallel jobs
Compute keys (e.g date,
corridor_id →
(1000*sender_country_id +
receiver_country_id))
Index relevant columns
,
Project-A, DWH Data Integration, 2014 11
Monitoring
Runtime stats: How long does
each job/process run
Timeline graph: How parallel is a
process
,
Project-A, DWH Data Integration, 2014 12
Monitoring 2
DB schema: Visualize Schema
Relation sizes: Visualize growth
over time
,
Project-A, DWH Data Integration, 2014 13
Monitoring 3
Index usage: Are indexes used or
unecessary?
,
Project-A, DWH Data Integration, 2014 14
Naming conventions
prefix schemata
(e.g. os_, om_)
schema names
(e.g. dim_next, dim, tmp, data)
,
Project-A, DWH Data Integration, 2014 15
Naming conventions 2
Jobs follow a pattern:
load load data into the data schema
transform transform data into the dim schema
copy copy data into the dim schema (no transformation)
flatten creates flattened tables for faster access
constrain applies foreign key constrains
,
Project-A, DWH Data Integration, 2014 16
Summary
Data integration is the largest portion of building a data warehouse
Ensure data quality by applying constraints and tests
Monitor your data integration process
,
Project-A, DWH Data Integration, 2014 17
For Further Reading I
Ralph Kimball
The Data Warehouse Toolkit.
Wiley, 2013.
,
Project-A, DWH Data Integration, 2014 18

Data Warehouse Data Integration

  • 1.
    DWH Data Integration ChristianStade-Schuldt Project-A Ventures BI Team Knowledge Transfer
  • 2.
  • 3.
    What is dataintegration? combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information encompasses discovery, cleansing, monitoring, transforming and delivery of data from a variety of sources by far the largest portion of building a data warehouse , Project-A, DWH Data Integration, 2014 3
  • 4.
    The ETL Process Extractdata from homogeneous or heterogeneous data sources Transform the data for storing it in proper format or structure for querying and analysis purpose Load it into the final target , Project-A, DWH Data Integration, 2014 4
  • 5.
    Processes and Jobs Process→ Set of jobs in a particular order Different processes for separation can run at different time intervals File-dependency management Visualize graph , Project-A, DWH Data Integration, 2014 5
  • 6.
    Processes and Jobs Job→ Set of commands, depend on other jobs Command → Specific action (eg. run sql file) ⇒ developer friendly (plain text files) , Project-A, DWH Data Integration, 2014 6
  • 7.
    Sources Comma-separated files JSON files variousdatabases (MySQL, PostgreSQL, Microsoft SQL Server) via project code external APIs (usually export to csv via cronjob) , Project-A, DWH Data Integration, 2014 7
  • 8.
    The Schema Life-Cycle Datawarehouse can be rebuild from scratch with every import Import runs on a next schema Switch schemata in the last step Failure does not impact current data warehouse , Project-A, DWH Data Integration, 2014 8
  • 9.
    Data Quality Real-world datais dirty Data quality is critical to data warehouse and business intelligence solutions Goal: single point of truth cleaned-up and validated data easily accessable for user , Project-A, DWH Data Integration, 2014 9
  • 10.
    Data Quality 2 Referentialintegrity → requires every value of one attribute (column) of a relation (table) to exist as a value of another attribute in a different (or the same) relation (table) Check constraints (ADD CHECK) Unique constraints Consistency checks → What goes in, has to come out, No one’s left behind, some are. :( , Project-A, DWH Data Integration, 2014 10
  • 11.
    Improving performance Cost-based schedulingfor jobs (Priority Queue) Incremental loads Parallel jobs Compute keys (e.g date, corridor_id → (1000*sender_country_id + receiver_country_id)) Index relevant columns , Project-A, DWH Data Integration, 2014 11
  • 12.
    Monitoring Runtime stats: Howlong does each job/process run Timeline graph: How parallel is a process , Project-A, DWH Data Integration, 2014 12
  • 13.
    Monitoring 2 DB schema:Visualize Schema Relation sizes: Visualize growth over time , Project-A, DWH Data Integration, 2014 13
  • 14.
    Monitoring 3 Index usage:Are indexes used or unecessary? , Project-A, DWH Data Integration, 2014 14
  • 15.
    Naming conventions prefix schemata (e.g.os_, om_) schema names (e.g. dim_next, dim, tmp, data) , Project-A, DWH Data Integration, 2014 15
  • 16.
    Naming conventions 2 Jobsfollow a pattern: load load data into the data schema transform transform data into the dim schema copy copy data into the dim schema (no transformation) flatten creates flattened tables for faster access constrain applies foreign key constrains , Project-A, DWH Data Integration, 2014 16
  • 17.
    Summary Data integration isthe largest portion of building a data warehouse Ensure data quality by applying constraints and tests Monitor your data integration process , Project-A, DWH Data Integration, 2014 17
  • 18.
    For Further ReadingI Ralph Kimball The Data Warehouse Toolkit. Wiley, 2013. , Project-A, DWH Data Integration, 2014 18