Data Warehouse Data Integration

DWH Data Integration
Christian Stade-Schuldt
Project-A Ventures
BI Team Knowledge Transfer

Outline
Motivation
Import
Data Quality
Perfomance
Monitoring
,
Project-A, DWH Data Integration, 2014 2

What is data integration?
combination of technical and business processes used
to combine data from disparate sources into meaningful
and valuable information
encompasses discovery, cleansing, monitoring,
transforming and delivery of data from a variety
of sources
by far the largest portion of building a data warehouse
,

The ETL Process
Extract data from homogeneous or heterogeneous data sources
Transform the data for storing it in proper format or structure for
querying and analysis purpose
Load it into the ﬁnal target
,

Processes and Jobs
Process → Set of jobs in a
particular order
Different processes for
separation
can run at different time
intervals
File-dependency management
Visualize graph
,

Processes and Jobs
Job → Set of commands,
depend on other jobs
Command → Specific action
(eg. run sql file)
⇒ developer friendly (plain
text files)
,

Sources
Comma-separated ﬁles
JSON ﬁles
various databases (MySQL,
PostgreSQL, Microsoft SQL
Server)
via project code
external APIs (usually export to
csv via cronjob)
,

The Schema Life-Cycle
Data warehouse can be rebuild from scratch with every import
Import runs on a next schema
Switch schemata in the last step
Failure does not impact current data warehouse
,

Data Quality
Real-world data is dirty
Data quality is critical to data warehouse and business intelligence
solutions
Goal:
single point of truth
cleaned-up and validated data
easily accessable for user
,

Data Quality 2
Referential integrity → requires every value of
one attribute (column) of a relation (table)
to exist as a value of another attribute in a different
(or the same) relation (table)
Check constraints (ADD CHECK)
Unique constraints
Consistency checks → What goes in, has to come out,
No one’s left behind, some are. :(
,

Improving performance
Cost-based scheduling for jobs
(Priority Queue)
Incremental loads
Parallel jobs
Compute keys (e.g date,
corridor_id →
(1000*sender_country_id +
receiver_country_id))
Index relevant columns
,

Monitoring
Runtime stats: How long does
each job/process run
Timeline graph: How parallel is a
process
,

Monitoring 2
DB schema: Visualize Schema
Relation sizes: Visualize growth
over time
,

Monitoring 3
Index usage: Are indexes used or
unecessary?
,

Naming conventions
preﬁx schemata
(e.g. os_, om_)
schema names
(e.g. dim_next, dim, tmp, data)
,

Naming conventions 2
Jobs follow a pattern:
load load data into the data schema
transform transform data into the dim schema
copy copy data into the dim schema (no transformation)
ﬂatten creates ﬂattened tables for faster access
constrain applies foreign key constrains
,

Summary
Data integration is the largest portion of building a data warehouse
Ensure data quality by applying constraints and tests
Monitor your data integration process
,

For Further Reading I
Ralph Kimball
The Data Warehouse Toolkit.
Wiley, 2013.
,

Data Warehouse Data Integration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Warehouse Data Integration

Similar to Data Warehouse Data Integration (20)

Recently uploaded

Recently uploaded (20)

Data Warehouse Data Integration