SlideShare a Scribd company logo
DWH Data Integration
Christian Stade-Schuldt
Project-A Ventures
BI Team Knowledge Transfer
Outline
Motivation
Import
Data Quality
Perfomance
Monitoring
,
Project-A, DWH Data Integration, 2014 2
What is data integration?
combination of technical and business processes used
to combine data from disparate sources into meaningful
and valuable information
encompasses discovery, cleansing, monitoring,
transforming and delivery of data from a variety
of sources
by far the largest portion of building a data warehouse
,
Project-A, DWH Data Integration, 2014 3
The ETL Process
Extract data from homogeneous or heterogeneous data sources
Transform the data for storing it in proper format or structure for
querying and analysis purpose
Load it into the final target
,
Project-A, DWH Data Integration, 2014 4
Processes and Jobs
Process → Set of jobs in a
particular order
Different processes for
separation
can run at different time
intervals
File-dependency management
Visualize graph
,
Project-A, DWH Data Integration, 2014 5
Processes and Jobs
Job → Set of commands,
depend on other jobs
Command → Specific action
(eg. run sql file)
⇒ developer friendly (plain
text files)
,
Project-A, DWH Data Integration, 2014 6
Sources
Comma-separated files
JSON files
various databases (MySQL,
PostgreSQL, Microsoft SQL
Server)
via project code
external APIs (usually export to
csv via cronjob)
,
Project-A, DWH Data Integration, 2014 7
The Schema Life-Cycle
Data warehouse can be rebuild from scratch with every import
Import runs on a next schema
Switch schemata in the last step
Failure does not impact current data warehouse
,
Project-A, DWH Data Integration, 2014 8
Data Quality
Real-world data is dirty
Data quality is critical to data warehouse and business intelligence
solutions
Goal:
single point of truth
cleaned-up and validated data
easily accessable for user
,
Project-A, DWH Data Integration, 2014 9
Data Quality 2
Referential integrity → requires every value of
one attribute (column) of a relation (table)
to exist as a value of another attribute in a different
(or the same) relation (table)
Check constraints (ADD CHECK)
Unique constraints
Consistency checks → What goes in, has to come out,
No one’s left behind, some are. :(
,
Project-A, DWH Data Integration, 2014 10
Improving performance
Cost-based scheduling for jobs
(Priority Queue)
Incremental loads
Parallel jobs
Compute keys (e.g date,
corridor_id →
(1000*sender_country_id +
receiver_country_id))
Index relevant columns
,
Project-A, DWH Data Integration, 2014 11
Monitoring
Runtime stats: How long does
each job/process run
Timeline graph: How parallel is a
process
,
Project-A, DWH Data Integration, 2014 12
Monitoring 2
DB schema: Visualize Schema
Relation sizes: Visualize growth
over time
,
Project-A, DWH Data Integration, 2014 13
Monitoring 3
Index usage: Are indexes used or
unecessary?
,
Project-A, DWH Data Integration, 2014 14
Naming conventions
prefix schemata
(e.g. os_, om_)
schema names
(e.g. dim_next, dim, tmp, data)
,
Project-A, DWH Data Integration, 2014 15
Naming conventions 2
Jobs follow a pattern:
load load data into the data schema
transform transform data into the dim schema
copy copy data into the dim schema (no transformation)
flatten creates flattened tables for faster access
constrain applies foreign key constrains
,
Project-A, DWH Data Integration, 2014 16
Summary
Data integration is the largest portion of building a data warehouse
Ensure data quality by applying constraints and tests
Monitor your data integration process
,
Project-A, DWH Data Integration, 2014 17
For Further Reading I
Ralph Kimball
The Data Warehouse Toolkit.
Wiley, 2013.
,
Project-A, DWH Data Integration, 2014 18

More Related Content

What's hot

Professional Portfolio
Professional PortfolioProfessional Portfolio
Professional Portfolio
MoniqueO Opris
 
Revamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation toolRevamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation tool
Bharath Nunepalli
 
DWHRestructure
DWHRestructureDWHRestructure
DWHRestructure
Vera Ekimenko
 
E132833
E132833E132833
E132833
irjes
 
Encompassing Information Integration
Encompassing Information IntegrationEncompassing Information Integration
Encompassing Information Integration
nguyenfilip
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
 
The Data Web and PLM
The Data Web and PLMThe Data Web and PLM
The Data Web and PLM
Koneksys
 
MS SQL SERVER: Using the data mining tools
MS SQL SERVER: Using the data mining toolsMS SQL SERVER: Using the data mining tools
MS SQL SERVER: Using the data mining tools
DataminingTools Inc
 
Data virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss TeiidData virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss Teiid
Anil Allewar
 
Pentaho
PentahoPentaho
Pentaho
teza123
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
Jboss Teiid - The data you have on the place you need
Jboss Teiid - The data you have on the place you needJboss Teiid - The data you have on the place you need
Jboss Teiid - The data you have on the place you need
Jackson dos Santos Olveira
 
Grid Vew Control VB
Grid Vew Control VBGrid Vew Control VB
Grid Vew Control VB
sunmitraeducation
 
Lee Granger Bi Portfolio
Lee Granger Bi PortfolioLee Granger Bi Portfolio
Lee Granger Bi Portfolio
LeeGranger
 
Introduction To Pentaho Kettle
Introduction To Pentaho KettleIntroduction To Pentaho Kettle
Introduction To Pentaho Kettle
Boulder Java User's Group
 
tecFinal 451 webinar deck
tecFinal 451 webinar decktecFinal 451 webinar deck
tecFinal 451 webinar deck
Basho Technologies
 
BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overview
BigData_Europe
 
Data Virtualization and ETL
Data Virtualization and ETLData Virtualization and ETL
Data Virtualization and ETL
Lily Luo
 
Isas report
Isas reportIsas report
Isas report
Tuấn Anh Nguyễn
 
Group Meeting Vamsas Project Final
Group Meeting Vamsas Project FinalGroup Meeting Vamsas Project Final
Group Meeting Vamsas Project Final
Pierre Marguerite
 

What's hot (20)

Professional Portfolio
Professional PortfolioProfessional Portfolio
Professional Portfolio
 
Revamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation toolRevamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation tool
 
DWHRestructure
DWHRestructureDWHRestructure
DWHRestructure
 
E132833
E132833E132833
E132833
 
Encompassing Information Integration
Encompassing Information IntegrationEncompassing Information Integration
Encompassing Information Integration
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
The Data Web and PLM
The Data Web and PLMThe Data Web and PLM
The Data Web and PLM
 
MS SQL SERVER: Using the data mining tools
MS SQL SERVER: Using the data mining toolsMS SQL SERVER: Using the data mining tools
MS SQL SERVER: Using the data mining tools
 
Data virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss TeiidData virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss Teiid
 
Pentaho
PentahoPentaho
Pentaho
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Jboss Teiid - The data you have on the place you need
Jboss Teiid - The data you have on the place you needJboss Teiid - The data you have on the place you need
Jboss Teiid - The data you have on the place you need
 
Grid Vew Control VB
Grid Vew Control VBGrid Vew Control VB
Grid Vew Control VB
 
Lee Granger Bi Portfolio
Lee Granger Bi PortfolioLee Granger Bi Portfolio
Lee Granger Bi Portfolio
 
Introduction To Pentaho Kettle
Introduction To Pentaho KettleIntroduction To Pentaho Kettle
Introduction To Pentaho Kettle
 
tecFinal 451 webinar deck
tecFinal 451 webinar decktecFinal 451 webinar deck
tecFinal 451 webinar deck
 
BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overview
 
Data Virtualization and ETL
Data Virtualization and ETLData Virtualization and ETL
Data Virtualization and ETL
 
Isas report
Isas reportIsas report
Isas report
 
Group Meeting Vamsas Project Final
Group Meeting Vamsas Project FinalGroup Meeting Vamsas Project Final
Group Meeting Vamsas Project Final
 

Similar to Data Warehouse Data Integration

SenaritraMSBI_Resume
SenaritraMSBI_ResumeSenaritraMSBI_Resume
SenaritraMSBI_Resume
Senaritra Das
 
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Kent Graziano
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
Maulik Thaker
 
Shashi.Kiran_CV
Shashi.Kiran_CVShashi.Kiran_CV
Shashi.Kiran_CV
Shashi Kiran
 
Resume_Gulley_Oct7_2016
Resume_Gulley_Oct7_2016Resume_Gulley_Oct7_2016
Resume_Gulley_Oct7_2016
William (Bill) Gulley
 
Geetha 6 yrs cv_july-2016
Geetha 6 yrs cv_july-2016Geetha 6 yrs cv_july-2016
Geetha 6 yrs cv_july-2016
Geetha Gayathri G B
 
Geetha_6 yrs_CV_July-2016
Geetha_6 yrs_CV_July-2016Geetha_6 yrs_CV_July-2016
Geetha_6 yrs_CV_July-2016
Geetha Gayathri G B
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
Cognizant
 
BI - Data warehousing in practice
BI - Data warehousing in practiceBI - Data warehousing in practice
BI - Data warehousing in practice
Sjors Otten
 
Resume
ResumeResume
Resume
rajeswari p
 
Data warehousing in practice 2015
Data warehousing in practice 2015Data warehousing in practice 2015
Data warehousing in practice 2015
Sjors Otten
 
Scope of Data Integration
Scope of Data IntegrationScope of Data Integration
Scope of Data Integration
HEXANIKA
 
Ashish_Maheshwari_Data_Analyst
Ashish_Maheshwari_Data_AnalystAshish_Maheshwari_Data_Analyst
Ashish_Maheshwari_Data_Analyst
Ashish Maheshwari
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Impetus Technologies
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
garyt1953
 
Basha_ETL_Developer
Basha_ETL_DeveloperBasha_ETL_Developer
Basha_ETL_Developer
basha shaik
 
Data warehousing in practice 2016
Data warehousing in practice 2016Data warehousing in practice 2016
Data warehousing in practice 2016
Sjors Otten
 
Resume
ResumeResume
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesBasics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration Techniques
Valmik Potbhare
 
rough-work.pptx
rough-work.pptxrough-work.pptx
rough-work.pptx
sharpan
 

Similar to Data Warehouse Data Integration (20)

SenaritraMSBI_Resume
SenaritraMSBI_ResumeSenaritraMSBI_Resume
SenaritraMSBI_Resume
 
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
 
Shashi.Kiran_CV
Shashi.Kiran_CVShashi.Kiran_CV
Shashi.Kiran_CV
 
Resume_Gulley_Oct7_2016
Resume_Gulley_Oct7_2016Resume_Gulley_Oct7_2016
Resume_Gulley_Oct7_2016
 
Geetha 6 yrs cv_july-2016
Geetha 6 yrs cv_july-2016Geetha 6 yrs cv_july-2016
Geetha 6 yrs cv_july-2016
 
Geetha_6 yrs_CV_July-2016
Geetha_6 yrs_CV_July-2016Geetha_6 yrs_CV_July-2016
Geetha_6 yrs_CV_July-2016
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
 
BI - Data warehousing in practice
BI - Data warehousing in practiceBI - Data warehousing in practice
BI - Data warehousing in practice
 
Resume
ResumeResume
Resume
 
Data warehousing in practice 2015
Data warehousing in practice 2015Data warehousing in practice 2015
Data warehousing in practice 2015
 
Scope of Data Integration
Scope of Data IntegrationScope of Data Integration
Scope of Data Integration
 
Ashish_Maheshwari_Data_Analyst
Ashish_Maheshwari_Data_AnalystAshish_Maheshwari_Data_Analyst
Ashish_Maheshwari_Data_Analyst
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
 
Basha_ETL_Developer
Basha_ETL_DeveloperBasha_ETL_Developer
Basha_ETL_Developer
 
Data warehousing in practice 2016
Data warehousing in practice 2016Data warehousing in practice 2016
Data warehousing in practice 2016
 
Resume
ResumeResume
Resume
 
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesBasics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration Techniques
 
rough-work.pptx
rough-work.pptxrough-work.pptx
rough-work.pptx
 

Recently uploaded

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 

Recently uploaded (20)

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 

Data Warehouse Data Integration

  • 1. DWH Data Integration Christian Stade-Schuldt Project-A Ventures BI Team Knowledge Transfer
  • 3. What is data integration? combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information encompasses discovery, cleansing, monitoring, transforming and delivery of data from a variety of sources by far the largest portion of building a data warehouse , Project-A, DWH Data Integration, 2014 3
  • 4. The ETL Process Extract data from homogeneous or heterogeneous data sources Transform the data for storing it in proper format or structure for querying and analysis purpose Load it into the final target , Project-A, DWH Data Integration, 2014 4
  • 5. Processes and Jobs Process → Set of jobs in a particular order Different processes for separation can run at different time intervals File-dependency management Visualize graph , Project-A, DWH Data Integration, 2014 5
  • 6. Processes and Jobs Job → Set of commands, depend on other jobs Command → Specific action (eg. run sql file) ⇒ developer friendly (plain text files) , Project-A, DWH Data Integration, 2014 6
  • 7. Sources Comma-separated files JSON files various databases (MySQL, PostgreSQL, Microsoft SQL Server) via project code external APIs (usually export to csv via cronjob) , Project-A, DWH Data Integration, 2014 7
  • 8. The Schema Life-Cycle Data warehouse can be rebuild from scratch with every import Import runs on a next schema Switch schemata in the last step Failure does not impact current data warehouse , Project-A, DWH Data Integration, 2014 8
  • 9. Data Quality Real-world data is dirty Data quality is critical to data warehouse and business intelligence solutions Goal: single point of truth cleaned-up and validated data easily accessable for user , Project-A, DWH Data Integration, 2014 9
  • 10. Data Quality 2 Referential integrity → requires every value of one attribute (column) of a relation (table) to exist as a value of another attribute in a different (or the same) relation (table) Check constraints (ADD CHECK) Unique constraints Consistency checks → What goes in, has to come out, No one’s left behind, some are. :( , Project-A, DWH Data Integration, 2014 10
  • 11. Improving performance Cost-based scheduling for jobs (Priority Queue) Incremental loads Parallel jobs Compute keys (e.g date, corridor_id → (1000*sender_country_id + receiver_country_id)) Index relevant columns , Project-A, DWH Data Integration, 2014 11
  • 12. Monitoring Runtime stats: How long does each job/process run Timeline graph: How parallel is a process , Project-A, DWH Data Integration, 2014 12
  • 13. Monitoring 2 DB schema: Visualize Schema Relation sizes: Visualize growth over time , Project-A, DWH Data Integration, 2014 13
  • 14. Monitoring 3 Index usage: Are indexes used or unecessary? , Project-A, DWH Data Integration, 2014 14
  • 15. Naming conventions prefix schemata (e.g. os_, om_) schema names (e.g. dim_next, dim, tmp, data) , Project-A, DWH Data Integration, 2014 15
  • 16. Naming conventions 2 Jobs follow a pattern: load load data into the data schema transform transform data into the dim schema copy copy data into the dim schema (no transformation) flatten creates flattened tables for faster access constrain applies foreign key constrains , Project-A, DWH Data Integration, 2014 16
  • 17. Summary Data integration is the largest portion of building a data warehouse Ensure data quality by applying constraints and tests Monitor your data integration process , Project-A, DWH Data Integration, 2014 17
  • 18. For Further Reading I Ralph Kimball The Data Warehouse Toolkit. Wiley, 2013. , Project-A, DWH Data Integration, 2014 18