SlideShare a Scribd company logo
1 of 16
Ahsan AbdullahAhsan Abdullah
11
Data WarehousingData Warehousing
Lecture-18Lecture-18
ETL Detail: Data Extraction & TransformationETL Detail: Data Extraction & Transformation
Virtual University of PakistanVirtual University of Pakistan
Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro-Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan1010@yahoo.com
Ahsan Abdullah
2
ETL Detail: Data Extraction &ETL Detail: Data Extraction &
TransformationTransformation
Ahsan Abdullah
3
Extracting Changed DataExtracting Changed Data
Incremental data extraction
Incremental data extraction i.e. what has changed, say during last 24
hrs if considering nightly extraction.
Efficient when changes can be identified
This is efficient, when the small changed data can be identified
efficiently.
Identification could be costly
Unfortunately, for many source systems, identifying the recently
modified data may be difficult or effect operation of the source
system.
Very challenging
Change Data Capture is therefore, typically the most challenging
technical issue in data extraction.
ONLY yellow part will go to Graphics
Ahsan Abdullah
4
Source SystemsSource Systems
Two CDC sourcesTwo CDC sources
• Modern systems
• Legacy systems
ONLY yellow part will go to Graphics
Ahsan Abdullah
5
CDC in Modern SystemsCDC in Modern Systems
• Time Stamps
• Works if timestamp column present
• If column not present, add column
• May not be possible to modify table, so add triggers
• Triggers
• Create trigger for each source table
• Following each DML operation trigger performs updates
• Record DML operations in a log
• Partitioning
• Table range partitioned, say along date key
• Easy to identify new data, say last week’s data
ONLY yellow part will go to Graphics
Ahsan Abdullah
6
CDC in Legacy SystemsCDC in Legacy Systems
 Changes recorded in tapesChanges recorded in tapes Changes occurred in legacyChanges occurred in legacy
transaction processing are recorded on the log or journaltransaction processing are recorded on the log or journal
tapes.tapes.
 Changes read and removed from tapesChanges read and removed from tapes Log or journal tape areLog or journal tape are
read and the update/transaction changes are stripped off forread and the update/transaction changes are stripped off for
movement into the data warehouse.movement into the data warehouse.
 Problems with reading a log/journal tape are many:Problems with reading a log/journal tape are many:
 Contains lot of extraneous dataContains lot of extraneous data
 Format is often arcaneFormat is often arcane
 Often contains addresses instead of data values and keysOften contains addresses instead of data values and keys
 Sequencing of data in the log tape often has deep and complexSequencing of data in the log tape often has deep and complex
implicationsimplications
 Log tape varies widely from one DBMS to another.Log tape varies widely from one DBMS to another.
ONLY yellow part will go to Graphics
Ahsan Abdullah
7
AdvantagesAdvantages
1.1. Immediate.Immediate.
2.2. No loss of historyNo loss of history
3.3. Flat files NOT requiredFlat files NOT required
CDC Advantages: Modern SystemsCDC Advantages: Modern Systems
Modern
Systems
Ahsan Abdullah
8
AdvantagesAdvantages
1.1. No incremental on-line I/O required for log tapeNo incremental on-line I/O required for log tape
2.2. The log tape captures all update processingThe log tape captures all update processing
3.3. Log tape processing can be taken off-line.Log tape processing can be taken off-line.
4.4. No haste to make waste.No haste to make waste.
CDC Advantages: Legacy SystemsCDC Advantages: Legacy Systems
Legacy
Systems
Ahsan Abdullah
9
Major Transformation TypesMajor Transformation Types
 Format revision
 Decoding of fields
 Calculated and derived values
 Splitting of single fields
 Merging of information
 Character set conversion
 Unit of measurement conversion
 Date/Time conversion
 Summarization
 Key restructuring
 Duplication
Ahsan Abdullah
10
 Format revision
 Decoding of fields
 Calculated and derived values
 Splitting of single fields
Covered in issues
Covered in De-Norm
ONLY yellow part will go to Graphics
Major Transformation TypesMajor Transformation Types
Ahsan Abdullah
11
 Merging of information
 Character set conversion
 Unit of measurement conversion
 Date/Time conversion
Not really means combining columns to create one column.
Info for product coming from different sources merging it into single entity.
ONLY yellow part will go to Graphics
For PC architecture converting legacy EBCIDIC to ASCII
For companies with global branches Km vs. mile or lb vs Kg
November 14, 2005 as 11/14/2005 in US and 14/11/2005 in the British format.
This date may be standardized to be written as 14 NOV 2005.
Major Transformation TypesMajor Transformation Types
Ahsan Abdullah
12
 Aggregation & Summarization
 How they are different?
Why both are required?
 Grain mismatch (don’t require, don’t have space)
 Data Marts requiring low detail
 Detail losing its utility
Adding
like values
Summarization with calculation across business dimension is
aggregation. Example Monthly compensation = monthly sale + bonus
ONLY yellow part will go to Graphics
Major Transformation TypesMajor Transformation Types
Ahsan Abdullah
13
 Key restructuring (inherent meaning at source)
 i.e. 92424979234 changed to 12345678
 Removing duplication
92 42 4979 234
Country_Code City_Code Post_Code Product_Code
ONLY yellow part will go to Graphics
Incorrect or missing value
Inconsistent naming convention ONE vs 1
Incomplete information
Physically moved, but address not changed
Misspelling or falsification of names
Major Transformation TypesMajor Transformation Types
Ahsan Abdullah
14
Data content defectsData content defects
• Domain value redundancy
 Non-standard data formats
 Non-atomic data values
 Multipurpose data fields
 Embedded meanings
 Inconsistent data values
 Data quality contamination
Ahsan Abdullah
15
Domain value redundancy
 Unit of Measure
 Dozen, Doz., Dz., 12
 Non-standard data formats
 Phone Numbers
 1234567 or 123.456.7
 Non-atomic data fields
 Name & Addresses
 Dr. Hameed Khan, PhD
ONLY yellow part will go to Graphics
Data content defects ExamplesData content defects Examples
Ahsan Abdullah
16
 Embedded Meanings
 RC, AP, RJ
 received, approved, rejected
Data content defects ExamplesData content defects Examples

More Related Content

What's hot

A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
A Common Database Approach for OLTP and OLAP Using an In-Memory Column DatabaseA Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
A Common Database Approach for OLTP and OLAP Using an In-Memory Column DatabaseIshara Amarasekera
 
Online analytical processing (olap) tools
Online analytical processing (olap) toolsOnline analytical processing (olap) tools
Online analytical processing (olap) toolskulkarnivaibhav
 
Data Warehousing And Data Mining Presentation Transcript
Data Warehousing And Data Mining   Presentation TranscriptData Warehousing And Data Mining   Presentation Transcript
Data Warehousing And Data Mining Presentation TranscriptSUBODH009
 
Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)LizLavaveshkul
 
data warehousing and data mining
data warehousing and data mining data warehousing and data mining
data warehousing and data mining E2MATRIX
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...OSTHUS
 
Lecture 1 introduction to data warehouse
Lecture 1 introduction to data warehouseLecture 1 introduction to data warehouse
Lecture 1 introduction to data warehouseShani729
 
An overview of data warehousing and OLAP technology
An overview of  data warehousing and OLAP technology An overview of  data warehousing and OLAP technology
An overview of data warehousing and OLAP technology Nikhatfatima16
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingEdureka!
 
introduction to data warehousing and mining
 introduction to data warehousing and mining introduction to data warehousing and mining
introduction to data warehousing and miningRajesh Chandra
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10AnwarrChaudary
 
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
Introducing to Datamining vs. OLAP -  مقدمه و مقایسه ای بر داده کاوی و تحلیل ...Introducing to Datamining vs. OLAP -  مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...y-asgari
 
Online analytical processing
Online analytical processingOnline analytical processing
Online analytical processingnurmeen1
 

What's hot (20)

Lecture 2
Lecture 2Lecture 2
Lecture 2
 
A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
A Common Database Approach for OLTP and OLAP Using an In-Memory Column DatabaseA Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
 
Online analytical processing (olap) tools
Online analytical processing (olap) toolsOnline analytical processing (olap) tools
Online analytical processing (olap) tools
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Data Warehousing And Data Mining Presentation Transcript
Data Warehousing And Data Mining   Presentation TranscriptData Warehousing And Data Mining   Presentation Transcript
Data Warehousing And Data Mining Presentation Transcript
 
Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)
 
Datawarehouse olap olam
Datawarehouse olap olamDatawarehouse olap olam
Datawarehouse olap olam
 
data warehousing and data mining
data warehousing and data mining data warehousing and data mining
data warehousing and data mining
 
Evolution of big data
Evolution of big dataEvolution of big data
Evolution of big data
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
Data ware house
Data ware houseData ware house
Data ware house
 
Lecture 1 introduction to data warehouse
Lecture 1 introduction to data warehouseLecture 1 introduction to data warehouse
Lecture 1 introduction to data warehouse
 
An overview of data warehousing and OLAP technology
An overview of  data warehousing and OLAP technology An overview of  data warehousing and OLAP technology
An overview of data warehousing and OLAP technology
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
introduction to data warehousing and mining
 introduction to data warehousing and mining introduction to data warehousing and mining
introduction to data warehousing and mining
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10
 
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
Introducing to Datamining vs. OLAP -  مقدمه و مقایسه ای بر داده کاوی و تحلیل ...Introducing to Datamining vs. OLAP -  مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
Online analytical processing
Online analytical processingOnline analytical processing
Online analytical processing
 

Similar to Lecture 18

Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)Andreas Buckenhofer
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyDonna Guazzaloca-Zehl
 
The Path to Digital Transformation
The Path to Digital TransformationThe Path to Digital Transformation
The Path to Digital TransformationPrecisely
 
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19cMaximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19cGlen Hawkins
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
 
Oracle GoldenGate for Zero Downtime Migration
Oracle GoldenGate for Zero Downtime MigrationOracle GoldenGate for Zero Downtime Migration
Oracle GoldenGate for Zero Downtime MigrationFumiko Yamashita
 
Collaborate 2012-business data transformation and consolidation
Collaborate 2012-business data transformation and consolidationCollaborate 2012-business data transformation and consolidation
Collaborate 2012-business data transformation and consolidationChain Sys Corporation
 
Collaborate 2012-business data transformation and consolidation for a global ...
Collaborate 2012-business data transformation and consolidation for a global ...Collaborate 2012-business data transformation and consolidation for a global ...
Collaborate 2012-business data transformation and consolidation for a global ...Chain Sys Corporation
 
Optimizing Oracle Databases & Applications Gives Fast Food Giant Major Gains
Optimizing Oracle Databases & Applications Gives Fast Food Giant Major GainsOptimizing Oracle Databases & Applications Gives Fast Food Giant Major Gains
Optimizing Oracle Databases & Applications Gives Fast Food Giant Major GainsDatavail
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfpbonillo1
 
Oracle GoldenGate 12c - Real Time Access to Real Time Information
Oracle GoldenGate 12c - Real Time Access to Real Time InformationOracle GoldenGate 12c - Real Time Access to Real Time Information
Oracle GoldenGate 12c - Real Time Access to Real Time InformationAsha BG
 
Day 02 sap_bi_overview_and_terminology
Day 02 sap_bi_overview_and_terminologyDay 02 sap_bi_overview_and_terminology
Day 02 sap_bi_overview_and_terminologytovetrivel
 
B7 accelerating your business with oracle data integration solutions
B7   accelerating your business with oracle data integration solutionsB7   accelerating your business with oracle data integration solutions
B7 accelerating your business with oracle data integration solutionsDr. Wilfred Lin (Ph.D.)
 
52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdf
52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdf52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdf
52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdfvitm11
 
DB2 9 for z/OS - Business Value
DB2 9 for z/OS  - Business  ValueDB2 9 for z/OS  - Business  Value
DB2 9 for z/OS - Business ValueSurekha Parekh
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdfssuserf8f9b2
 

Similar to Lecture 18 (20)

Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication Technology
 
The Path to Digital Transformation
The Path to Digital TransformationThe Path to Digital Transformation
The Path to Digital Transformation
 
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19cMaximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19c
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Oracle GoldenGate for Zero Downtime Migration
Oracle GoldenGate for Zero Downtime MigrationOracle GoldenGate for Zero Downtime Migration
Oracle GoldenGate for Zero Downtime Migration
 
Collaborate 2012-business data transformation and consolidation
Collaborate 2012-business data transformation and consolidationCollaborate 2012-business data transformation and consolidation
Collaborate 2012-business data transformation and consolidation
 
Collaborate 2012-business data transformation and consolidation for a global ...
Collaborate 2012-business data transformation and consolidation for a global ...Collaborate 2012-business data transformation and consolidation for a global ...
Collaborate 2012-business data transformation and consolidation for a global ...
 
Optimizing Oracle Databases & Applications Gives Fast Food Giant Major Gains
Optimizing Oracle Databases & Applications Gives Fast Food Giant Major GainsOptimizing Oracle Databases & Applications Gives Fast Food Giant Major Gains
Optimizing Oracle Databases & Applications Gives Fast Food Giant Major Gains
 
Data Vault and DW2.0
Data Vault and DW2.0Data Vault and DW2.0
Data Vault and DW2.0
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
Oracle GoldenGate 12c - Real Time Access to Real Time Information
Oracle GoldenGate 12c - Real Time Access to Real Time InformationOracle GoldenGate 12c - Real Time Access to Real Time Information
Oracle GoldenGate 12c - Real Time Access to Real Time Information
 
Day 02 sap_bi_overview_and_terminology
Day 02 sap_bi_overview_and_terminologyDay 02 sap_bi_overview_and_terminology
Day 02 sap_bi_overview_and_terminology
 
B7 accelerating your business with oracle data integration solutions
B7   accelerating your business with oracle data integration solutionsB7   accelerating your business with oracle data integration solutions
B7 accelerating your business with oracle data integration solutions
 
Best practices and trends in people soft
Best practices and trends in people softBest practices and trends in people soft
Best practices and trends in people soft
 
Operational Data Vault
Operational Data VaultOperational Data Vault
Operational Data Vault
 
52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdf
52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdf52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdf
52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdf
 
DB2 9 for z/OS - Business Value
DB2 9 for z/OS  - Business  ValueDB2 9 for z/OS  - Business  Value
DB2 9 for z/OS - Business Value
 
ETL_Methodology.pptx
ETL_Methodology.pptxETL_Methodology.pptx
ETL_Methodology.pptx
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdf
 

More from Shani729

Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012Shani729
 
Python tutorial
Python tutorialPython tutorial
Python tutorialShani729
 
Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionShani729
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)Shani729
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15Shani729
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodShani729
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15Shani729
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10Shani729
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Shani729
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Shani729
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Shani729
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2Shani729
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1Shani729
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13Shani729
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Shani729
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furcShani729
 
Lecture 40
Lecture 40Lecture 40
Lecture 40Shani729
 
Lecture 39
Lecture 39Lecture 39
Lecture 39Shani729
 
Lecture 38
Lecture 38Lecture 38
Lecture 38Shani729
 
Lecture 37
Lecture 37Lecture 37
Lecture 37Shani729
 

More from Shani729 (20)

Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interaction
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furc
 
Lecture 40
Lecture 40Lecture 40
Lecture 40
 
Lecture 39
Lecture 39Lecture 39
Lecture 39
 
Lecture 38
Lecture 38Lecture 38
Lecture 38
 
Lecture 37
Lecture 37Lecture 37
Lecture 37
 

Recently uploaded

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 

Recently uploaded (20)

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 

Lecture 18

  • 1. Ahsan AbdullahAhsan Abdullah 11 Data WarehousingData Warehousing Lecture-18Lecture-18 ETL Detail: Data Extraction & TransformationETL Detail: Data Extraction & Transformation Virtual University of PakistanVirtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research www.nu.edu.pk/cairindex.asp National University of Computers & Emerging Sciences, Islamabad Email: ahsan1010@yahoo.com
  • 2. Ahsan Abdullah 2 ETL Detail: Data Extraction &ETL Detail: Data Extraction & TransformationTransformation
  • 3. Ahsan Abdullah 3 Extracting Changed DataExtracting Changed Data Incremental data extraction Incremental data extraction i.e. what has changed, say during last 24 hrs if considering nightly extraction. Efficient when changes can be identified This is efficient, when the small changed data can be identified efficiently. Identification could be costly Unfortunately, for many source systems, identifying the recently modified data may be difficult or effect operation of the source system. Very challenging Change Data Capture is therefore, typically the most challenging technical issue in data extraction. ONLY yellow part will go to Graphics
  • 4. Ahsan Abdullah 4 Source SystemsSource Systems Two CDC sourcesTwo CDC sources • Modern systems • Legacy systems ONLY yellow part will go to Graphics
  • 5. Ahsan Abdullah 5 CDC in Modern SystemsCDC in Modern Systems • Time Stamps • Works if timestamp column present • If column not present, add column • May not be possible to modify table, so add triggers • Triggers • Create trigger for each source table • Following each DML operation trigger performs updates • Record DML operations in a log • Partitioning • Table range partitioned, say along date key • Easy to identify new data, say last week’s data ONLY yellow part will go to Graphics
  • 6. Ahsan Abdullah 6 CDC in Legacy SystemsCDC in Legacy Systems  Changes recorded in tapesChanges recorded in tapes Changes occurred in legacyChanges occurred in legacy transaction processing are recorded on the log or journaltransaction processing are recorded on the log or journal tapes.tapes.  Changes read and removed from tapesChanges read and removed from tapes Log or journal tape areLog or journal tape are read and the update/transaction changes are stripped off forread and the update/transaction changes are stripped off for movement into the data warehouse.movement into the data warehouse.  Problems with reading a log/journal tape are many:Problems with reading a log/journal tape are many:  Contains lot of extraneous dataContains lot of extraneous data  Format is often arcaneFormat is often arcane  Often contains addresses instead of data values and keysOften contains addresses instead of data values and keys  Sequencing of data in the log tape often has deep and complexSequencing of data in the log tape often has deep and complex implicationsimplications  Log tape varies widely from one DBMS to another.Log tape varies widely from one DBMS to another. ONLY yellow part will go to Graphics
  • 7. Ahsan Abdullah 7 AdvantagesAdvantages 1.1. Immediate.Immediate. 2.2. No loss of historyNo loss of history 3.3. Flat files NOT requiredFlat files NOT required CDC Advantages: Modern SystemsCDC Advantages: Modern Systems Modern Systems
  • 8. Ahsan Abdullah 8 AdvantagesAdvantages 1.1. No incremental on-line I/O required for log tapeNo incremental on-line I/O required for log tape 2.2. The log tape captures all update processingThe log tape captures all update processing 3.3. Log tape processing can be taken off-line.Log tape processing can be taken off-line. 4.4. No haste to make waste.No haste to make waste. CDC Advantages: Legacy SystemsCDC Advantages: Legacy Systems Legacy Systems
  • 9. Ahsan Abdullah 9 Major Transformation TypesMajor Transformation Types  Format revision  Decoding of fields  Calculated and derived values  Splitting of single fields  Merging of information  Character set conversion  Unit of measurement conversion  Date/Time conversion  Summarization  Key restructuring  Duplication
  • 10. Ahsan Abdullah 10  Format revision  Decoding of fields  Calculated and derived values  Splitting of single fields Covered in issues Covered in De-Norm ONLY yellow part will go to Graphics Major Transformation TypesMajor Transformation Types
  • 11. Ahsan Abdullah 11  Merging of information  Character set conversion  Unit of measurement conversion  Date/Time conversion Not really means combining columns to create one column. Info for product coming from different sources merging it into single entity. ONLY yellow part will go to Graphics For PC architecture converting legacy EBCIDIC to ASCII For companies with global branches Km vs. mile or lb vs Kg November 14, 2005 as 11/14/2005 in US and 14/11/2005 in the British format. This date may be standardized to be written as 14 NOV 2005. Major Transformation TypesMajor Transformation Types
  • 12. Ahsan Abdullah 12  Aggregation & Summarization  How they are different? Why both are required?  Grain mismatch (don’t require, don’t have space)  Data Marts requiring low detail  Detail losing its utility Adding like values Summarization with calculation across business dimension is aggregation. Example Monthly compensation = monthly sale + bonus ONLY yellow part will go to Graphics Major Transformation TypesMajor Transformation Types
  • 13. Ahsan Abdullah 13  Key restructuring (inherent meaning at source)  i.e. 92424979234 changed to 12345678  Removing duplication 92 42 4979 234 Country_Code City_Code Post_Code Product_Code ONLY yellow part will go to Graphics Incorrect or missing value Inconsistent naming convention ONE vs 1 Incomplete information Physically moved, but address not changed Misspelling or falsification of names Major Transformation TypesMajor Transformation Types
  • 14. Ahsan Abdullah 14 Data content defectsData content defects • Domain value redundancy  Non-standard data formats  Non-atomic data values  Multipurpose data fields  Embedded meanings  Inconsistent data values  Data quality contamination
  • 15. Ahsan Abdullah 15 Domain value redundancy  Unit of Measure  Dozen, Doz., Dz., 12  Non-standard data formats  Phone Numbers  1234567 or 123.456.7  Non-atomic data fields  Name & Addresses  Dr. Hameed Khan, PhD ONLY yellow part will go to Graphics Data content defects ExamplesData content defects Examples
  • 16. Ahsan Abdullah 16  Embedded Meanings  RC, AP, RJ  received, approved, rejected Data content defects ExamplesData content defects Examples