SlideShare a Scribd company logo
Ahsan AbdullahAhsan Abdullah
11
Data WarehousingData Warehousing
Lecture-16Lecture-16
Extract Transform Load (ETL)Extract Transform Load (ETL)
Virtual University of PakistanVirtual University of Pakistan
Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro-Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan1010@yahoo.com
Ahsan Abdullah
2
Extract Transform Load (ETL)Extract Transform Load (ETL)
Ahsan Abdullah
3
Data Warehouse Server
(Tier 1)
OLAP Servers
(Tier 2)
Clients
(Tier 3)
Data
Warehouse
Operational
Data Bases
Semistructured
Sources
MOLAP
ROLAP
Query/Reporting

Data Marts Tools
Meta
Data
Data sources
Data
(Tier 0)





IT
Users


Business
Users


Business Users
Data Mining

Archived
data
Analysis

ExtractExtract
TransformTransform
LoadLoad
(ETL)(ETL)
www data
Putting the pieces togetherPutting the pieces together
{Comment: All except ETL washed out look}
Ahsan Abdullah
4
The ETL CycleThe ETL Cycle
EEXTRACTXTRACT
The process of reading
data from different
sources.
TTRANSFORMRANSFORM
The process of transforming the
extracted data from its original
state into a consistent state so
that it can be placed into
another database.
LLOADOAD
The process of writing
the data into the target
source.
TRANSFORM CLEANSE
LOAD
Data Warehouse
OLAP
TemporaryTemporary
Data storageData storage
EXTRACT
MIS Systems
(Acct, HR)
Legacy
Systems
Other indigenous applications
(COBOL, VB, C++, Java)

Archived data
www data
Ahsan Abdullah
5
ETL ProcessingETL Processing
ExtractsExtracts
fromfrom
sourcesource
systemssystems
DataData
MovementMovement
DataData
Transfor-Transfor-
mationmation
DataData
LoadingLoading
IndexIndex
Mainte-Mainte-
nancenance
StatisticsStatistics
CollectioCollectio
DataData
CleansingCleansing
ETL is independent yet interrelated steps.
It is important to look at the big picture.
Data acquisition time may include…
Backup
Back-up is a major task, its a DWH not a cubeBack-up is a major task, its a DWH not a cube
Note: Backup will come as other
elements after “Statistical collection”
Ahsan Abdullah
6
Overview of Data ExtractionOverview of Data Extraction
First step of ETL, followed by many.
Source system for extraction are typically OLTP
systems.
A very complex task due to number of reasons:
 Very complex and poorly documented source system.
 Data has to be extracted not once, but number of times.

The process design is dependent on:
 Which extraction method to choose?
 How to make available extracted data for further
processing?
Ahsan Abdullah
7
Types of Data ExtractionTypes of Data Extraction
 Logical Extraction
 Full Extraction
 Incremental Extraction
 Physical Extraction
 Online Extraction
 Offline Extraction
 Legacy vs. OLTP
Ahsan Abdullah
8
Logical Data ExtractionLogical Data Extraction
 Full Extraction
 The data extracted completely from the source system.
 No need to keep track of changes.
 Source data made available as-is with any additional
information.
 Incremental Extraction
 Data extracted after a well defined point/event in time.
 Mechanism used to reflect/record the temporal changes in data
(column or table).
 Sometimes entire tables off-loaded from source system into the
DWH.
 Can have significant performance impacts on the data
warehouse server.
Ahsan Abdullah
9
Physical Data Extraction…Physical Data Extraction…
 Online Extraction
 Data extracted directly from the source system.
 May access source tables through an intermediate system.
 Intermediate system usually similar to the source system.
 Offline Extraction
 Data NOT extracted directly from the source system, instead staged
explicitly outside the original source system.
 Data is either already structured or was created by an extraction
routine.
 Some of the prevalent structures are:
 Flat files
 Dump files
 Redo and archive logs
 Transportable table-spaces
Ahsan Abdullah
10
Physical Data ExtractionPhysical Data Extraction
 Legacy vs. OLTP
 Data moved from the source system
 Copy made of the source system data
 Staging area used for performance reasons
Ahsan Abdullah
11
Data TransformationData Transformation
 Basic tasks
1. Selection
2. Splitting/Joining
3. Conversion
4. Summarization
5. Enrichment
Ahsan Abdullah
12
Data Transformation Basic TasksData Transformation Basic Tasks
 Selection
Ahsan Abdullah
13
Data Transformation Basic TasksData Transformation Basic Tasks
 Splitting/joining
Ahsan Abdullah
14
Data Transformation Basic TasksData Transformation Basic Tasks
 Conversion
Ahsan Abdullah
15
Data Transformation Basic Tasks: ConversionData Transformation Basic Tasks: Conversion
Example-1Example-1
 Convert common data elements into a consistent
form i.e. name and address.
 Translation of dissimilar codes into a standard
code.
Field formatField format Field dataField data
First-Family-title Muhammad Ibrahim Contractor
Family-title-comma-first Ibrahim Contractor, Muhammad
Family-comma-first-title Ibrahim, Muhammad Contractor
Natl. ID NID
National ID NID
F/NO-2
F-2
FL.NO.2
FL.2
FL/NO.2
FL-2
FLAT-2
FLAT#
FLAT,2
FLAT-NO-2
FL-NO.2
FLAT No. 2
Ahsan Abdullah
16
 Data representation change
 EBCIDIC to ASCII
 Operating System Change
 Mainframe (MVS) to UNIX
 UNIX to NT or XP
 Data type change
 Program (Excel to Access), database format (FoxPro to
Access).
 Character, numeric and date type.
 Fixed and variable length.
Data Transformation Basic Tasks: ConversionData Transformation Basic Tasks: Conversion
Example-2Example-2
Ahsan Abdullah
17
Data Transformation Basic TasksData Transformation Basic Tasks
 Summarization
Ahsan Abdullah
18
Data Transformation Basic TasksData Transformation Basic Tasks
 Enrichment
Ahsan Abdullah
19
Data Transformation Basic Tasks: EnrichmentData Transformation Basic Tasks: Enrichment
ExampleExample
 Data elements are mapped from source tables
and files to destination fact and dimension tables.
 Default values are used in the absence of source
data.
 Fields are added for unique keys and time
elements.
Input DataInput Data
HAJI MUHAMMAD IBRAHIM, GOVT. CONT.
K. S. ABDULLAH & BROTHERS,
MAMOOJI ROAD, ABDULLAH MANZIL
RAWALPINDI, Ph 67855
Parsed DataParsed Data
First Name: HAJI MUHAMMAD
Family Name: IBRAHIM
Title: GOVT. CONT.
Firm: K. S. ABDULLAH &
BROTHERS
Firm Location: ABDULLAH MANZIL
Road: MAMOOJI ROAD
Phone: 051-67855
City: RAWALPINDI
Code: 46200
Ahsan Abdullah
20
Aspects of Data Loading StrategiesAspects of Data Loading Strategies
 Need to look at:
 Data freshness
 System performance
 Data volatility
 Data Freshness
 Very fresh low update efficiency
 Historical data, high update efficiency
 Always trade-offs in the light of goals
 System performance
 Availability of staging table space
 Impact on query workload
 Data Volatility
 Ratio of new to historical data
 High percentages of data change (batch update)
Ahsan Abdullah
21
Three Loading StrategiesThree Loading Strategies
 Once we have transformed data, there are threeOnce we have transformed data, there are three
primary loading strategies:primary loading strategies:
 Full data refreshFull data refresh with BLOCK INSERT or ‘blockwith BLOCK INSERT or ‘block
slamming’ into empty table.slamming’ into empty table.
 Incremental data refreshIncremental data refresh with BLOCK INSERT orwith BLOCK INSERT or
‘block slamming’ into existing (populated) tables.‘block slamming’ into existing (populated) tables.
 Trickle/continuous feedTrickle/continuous feed with constant datawith constant data
collection and loading using row level insert andcollection and loading using row level insert and
update operations.update operations.

More Related Content

What's hot

Data Warehouse
Data Warehouse Data Warehouse
Data Warehouse
MadhuriNigam1
 
Online analytical processing (olap) tools
Online analytical processing (olap) toolsOnline analytical processing (olap) tools
Online analytical processing (olap) tools
kulkarnivaibhav
 
Olap, oltp and data mining
Olap, oltp and data miningOlap, oltp and data mining
Olap, oltp and data mining
zafrii
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
OLAP v/s OLTP
OLAP v/s OLTPOLAP v/s OLTP
OLAP v/s OLTP
ahsan irfan
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
Daniel JACOB
 
A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
A Common Database Approach for OLTP and OLAP Using an In-Memory Column DatabaseA Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
Ishara Amarasekera
 
02. Data Warehouse and OLAP
02. Data Warehouse and OLAP02. Data Warehouse and OLAP
02. Data Warehouse and OLAP
Achmad Solichin
 
Olap and metadata
Olap and metadata Olap and metadata
Olap and metadata
Punk Milton
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
Edureka!
 
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
Puneet Kansal
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
Ambuj Kumar
 
Big data
Big dataBig data
Big data
Mohamed Salman
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technology
DataminingTools Inc
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
Mishika Bharadwaj
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
Revolution Analytics
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Edureka!
 
Why Talend for Big Data?
Why Talend for Big Data?Why Talend for Big Data?
Why Talend for Big Data?
Edureka!
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
Murtaza Doctor
 

What's hot (20)

Data Warehouse
Data Warehouse Data Warehouse
Data Warehouse
 
Online analytical processing (olap) tools
Online analytical processing (olap) toolsOnline analytical processing (olap) tools
Online analytical processing (olap) tools
 
Olap, oltp and data mining
Olap, oltp and data miningOlap, oltp and data mining
Olap, oltp and data mining
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
OLAP v/s OLTP
OLAP v/s OLTPOLAP v/s OLTP
OLAP v/s OLTP
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
 
A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
A Common Database Approach for OLTP and OLAP Using an In-Memory Column DatabaseA Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
 
02. Data Warehouse and OLAP
02. Data Warehouse and OLAP02. Data Warehouse and OLAP
02. Data Warehouse and OLAP
 
Olap and metadata
Olap and metadata Olap and metadata
Olap and metadata
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Big data
Big dataBig data
Big data
 
Dwh faqs
Dwh faqsDwh faqs
Dwh faqs
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technology
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
 
Why Talend for Big Data?
Why Talend for Big Data?Why Talend for Big Data?
Why Talend for Big Data?
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 

Similar to Lecture 16

Intro to Data warehousing lecture 09
Intro to Data warehousing   lecture 09Intro to Data warehousing   lecture 09
Intro to Data warehousing lecture 09
AnwarrChaudary
 
project_phrase I.pptx
project_phrase I.pptxproject_phrase I.pptx
project_phrase I.pptx
Nambiraju
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
JesusaEspeleta
 
Information Flow Mechanism in Data warehouse
Information Flow Mechanism in Data warehouseInformation Flow Mechanism in Data warehouse
Information Flow Mechanism in Data warehouse
GunjanShree1
 
ETL_Methodology.pptx
ETL_Methodology.pptxETL_Methodology.pptx
ETL_Methodology.pptx
yogeshsuryawanshi47
 
ETL Process
ETL ProcessETL Process
ETL Process
Rohin Rangnekar
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
DataWorks Summit/Hadoop Summit
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
Splunk
 
ETL Process
ETL ProcessETL Process
ETL Process
Rashmi Bhat
 
Cloud data loading
Cloud data loadingCloud data loading
Cloud data loading
Feras Ahmad
 
Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Andreas Buckenhofer
 
ETL DW-RealTime
ETL DW-RealTimeETL DW-RealTime
ETL DW-RealTime
Adriano Patrick Cunha
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
Splunk
 
The Database Environment Chapter 11
The Database Environment Chapter 11The Database Environment Chapter 11
The Database Environment Chapter 11
Jeanie Arnoco
 
GROPSIKS.pptx
GROPSIKS.pptxGROPSIKS.pptx
GROPSIKS.pptx
avanceregine312
 
Datawarehousing & DSS
Datawarehousing & DSSDatawarehousing & DSS
Datawarehousing & DSS
Deepali Raut
 
DATA Warehousing & Data Mining
DATA Warehousing & Data MiningDATA Warehousing & Data Mining
DATA Warehousing & Data Mining
cpjcollege
 
moving data between the data bases in database
moving data between the data bases in databasemoving data between the data bases in database
moving data between the data bases in database
mqasimsheikh5
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystem
DataWorks Summit/Hadoop Summit
 

Similar to Lecture 16 (20)

Intro to Data warehousing lecture 09
Intro to Data warehousing   lecture 09Intro to Data warehousing   lecture 09
Intro to Data warehousing lecture 09
 
project_phrase I.pptx
project_phrase I.pptxproject_phrase I.pptx
project_phrase I.pptx
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
 
Information Flow Mechanism in Data warehouse
Information Flow Mechanism in Data warehouseInformation Flow Mechanism in Data warehouse
Information Flow Mechanism in Data warehouse
 
ETL_Methodology.pptx
ETL_Methodology.pptxETL_Methodology.pptx
ETL_Methodology.pptx
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Cloud data loading
Cloud data loadingCloud data loading
Cloud data loading
 
Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
 
ETL DW-RealTime
ETL DW-RealTimeETL DW-RealTime
ETL DW-RealTime
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
 
The Database Environment Chapter 11
The Database Environment Chapter 11The Database Environment Chapter 11
The Database Environment Chapter 11
 
GROPSIKS.pptx
GROPSIKS.pptxGROPSIKS.pptx
GROPSIKS.pptx
 
Datawarehousing & DSS
Datawarehousing & DSSDatawarehousing & DSS
Datawarehousing & DSS
 
DATA Warehousing & Data Mining
DATA Warehousing & Data MiningDATA Warehousing & Data Mining
DATA Warehousing & Data Mining
 
moving data between the data bases in database
moving data between the data bases in databasemoving data between the data bases in database
moving data between the data bases in database
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystem
 

More from Shani729

Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012
Shani729
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
Shani729
 
Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interaction
Shani729
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)
Shani729
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
Shani729
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
Shani729
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15
Shani729
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10
Shani729
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8
Shani729
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6
Shani729
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4
Shani729
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2
Shani729
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1
Shani729
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13
Shani729
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13
Shani729
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furc
Shani729
 
Lecture 40
Lecture 40Lecture 40
Lecture 40
Shani729
 
Lecture 39
Lecture 39Lecture 39
Lecture 39
Shani729
 
Lecture 38
Lecture 38Lecture 38
Lecture 38
Shani729
 
Lecture 37
Lecture 37Lecture 37
Lecture 37
Shani729
 

More from Shani729 (20)

Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interaction
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furc
 
Lecture 40
Lecture 40Lecture 40
Lecture 40
 
Lecture 39
Lecture 39Lecture 39
Lecture 39
 
Lecture 38
Lecture 38Lecture 38
Lecture 38
 
Lecture 37
Lecture 37Lecture 37
Lecture 37
 

Recently uploaded

J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 

Recently uploaded (20)

J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 

Lecture 16

  • 1. Ahsan AbdullahAhsan Abdullah 11 Data WarehousingData Warehousing Lecture-16Lecture-16 Extract Transform Load (ETL)Extract Transform Load (ETL) Virtual University of PakistanVirtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research www.nu.edu.pk/cairindex.asp National University of Computers & Emerging Sciences, Islamabad Email: ahsan1010@yahoo.com
  • 2. Ahsan Abdullah 2 Extract Transform Load (ETL)Extract Transform Load (ETL)
  • 3. Ahsan Abdullah 3 Data Warehouse Server (Tier 1) OLAP Servers (Tier 2) Clients (Tier 3) Data Warehouse Operational Data Bases Semistructured Sources MOLAP ROLAP Query/Reporting  Data Marts Tools Meta Data Data sources Data (Tier 0)      IT Users   Business Users   Business Users Data Mining  Archived data Analysis  ExtractExtract TransformTransform LoadLoad (ETL)(ETL) www data Putting the pieces togetherPutting the pieces together {Comment: All except ETL washed out look}
  • 4. Ahsan Abdullah 4 The ETL CycleThe ETL Cycle EEXTRACTXTRACT The process of reading data from different sources. TTRANSFORMRANSFORM The process of transforming the extracted data from its original state into a consistent state so that it can be placed into another database. LLOADOAD The process of writing the data into the target source. TRANSFORM CLEANSE LOAD Data Warehouse OLAP TemporaryTemporary Data storageData storage EXTRACT MIS Systems (Acct, HR) Legacy Systems Other indigenous applications (COBOL, VB, C++, Java)  Archived data www data
  • 5. Ahsan Abdullah 5 ETL ProcessingETL Processing ExtractsExtracts fromfrom sourcesource systemssystems DataData MovementMovement DataData Transfor-Transfor- mationmation DataData LoadingLoading IndexIndex Mainte-Mainte- nancenance StatisticsStatistics CollectioCollectio DataData CleansingCleansing ETL is independent yet interrelated steps. It is important to look at the big picture. Data acquisition time may include… Backup Back-up is a major task, its a DWH not a cubeBack-up is a major task, its a DWH not a cube Note: Backup will come as other elements after “Statistical collection”
  • 6. Ahsan Abdullah 6 Overview of Data ExtractionOverview of Data Extraction First step of ETL, followed by many. Source system for extraction are typically OLTP systems. A very complex task due to number of reasons:  Very complex and poorly documented source system.  Data has to be extracted not once, but number of times.  The process design is dependent on:  Which extraction method to choose?  How to make available extracted data for further processing?
  • 7. Ahsan Abdullah 7 Types of Data ExtractionTypes of Data Extraction  Logical Extraction  Full Extraction  Incremental Extraction  Physical Extraction  Online Extraction  Offline Extraction  Legacy vs. OLTP
  • 8. Ahsan Abdullah 8 Logical Data ExtractionLogical Data Extraction  Full Extraction  The data extracted completely from the source system.  No need to keep track of changes.  Source data made available as-is with any additional information.  Incremental Extraction  Data extracted after a well defined point/event in time.  Mechanism used to reflect/record the temporal changes in data (column or table).  Sometimes entire tables off-loaded from source system into the DWH.  Can have significant performance impacts on the data warehouse server.
  • 9. Ahsan Abdullah 9 Physical Data Extraction…Physical Data Extraction…  Online Extraction  Data extracted directly from the source system.  May access source tables through an intermediate system.  Intermediate system usually similar to the source system.  Offline Extraction  Data NOT extracted directly from the source system, instead staged explicitly outside the original source system.  Data is either already structured or was created by an extraction routine.  Some of the prevalent structures are:  Flat files  Dump files  Redo and archive logs  Transportable table-spaces
  • 10. Ahsan Abdullah 10 Physical Data ExtractionPhysical Data Extraction  Legacy vs. OLTP  Data moved from the source system  Copy made of the source system data  Staging area used for performance reasons
  • 11. Ahsan Abdullah 11 Data TransformationData Transformation  Basic tasks 1. Selection 2. Splitting/Joining 3. Conversion 4. Summarization 5. Enrichment
  • 12. Ahsan Abdullah 12 Data Transformation Basic TasksData Transformation Basic Tasks  Selection
  • 13. Ahsan Abdullah 13 Data Transformation Basic TasksData Transformation Basic Tasks  Splitting/joining
  • 14. Ahsan Abdullah 14 Data Transformation Basic TasksData Transformation Basic Tasks  Conversion
  • 15. Ahsan Abdullah 15 Data Transformation Basic Tasks: ConversionData Transformation Basic Tasks: Conversion Example-1Example-1  Convert common data elements into a consistent form i.e. name and address.  Translation of dissimilar codes into a standard code. Field formatField format Field dataField data First-Family-title Muhammad Ibrahim Contractor Family-title-comma-first Ibrahim Contractor, Muhammad Family-comma-first-title Ibrahim, Muhammad Contractor Natl. ID NID National ID NID F/NO-2 F-2 FL.NO.2 FL.2 FL/NO.2 FL-2 FLAT-2 FLAT# FLAT,2 FLAT-NO-2 FL-NO.2 FLAT No. 2
  • 16. Ahsan Abdullah 16  Data representation change  EBCIDIC to ASCII  Operating System Change  Mainframe (MVS) to UNIX  UNIX to NT or XP  Data type change  Program (Excel to Access), database format (FoxPro to Access).  Character, numeric and date type.  Fixed and variable length. Data Transformation Basic Tasks: ConversionData Transformation Basic Tasks: Conversion Example-2Example-2
  • 17. Ahsan Abdullah 17 Data Transformation Basic TasksData Transformation Basic Tasks  Summarization
  • 18. Ahsan Abdullah 18 Data Transformation Basic TasksData Transformation Basic Tasks  Enrichment
  • 19. Ahsan Abdullah 19 Data Transformation Basic Tasks: EnrichmentData Transformation Basic Tasks: Enrichment ExampleExample  Data elements are mapped from source tables and files to destination fact and dimension tables.  Default values are used in the absence of source data.  Fields are added for unique keys and time elements. Input DataInput Data HAJI MUHAMMAD IBRAHIM, GOVT. CONT. K. S. ABDULLAH & BROTHERS, MAMOOJI ROAD, ABDULLAH MANZIL RAWALPINDI, Ph 67855 Parsed DataParsed Data First Name: HAJI MUHAMMAD Family Name: IBRAHIM Title: GOVT. CONT. Firm: K. S. ABDULLAH & BROTHERS Firm Location: ABDULLAH MANZIL Road: MAMOOJI ROAD Phone: 051-67855 City: RAWALPINDI Code: 46200
  • 20. Ahsan Abdullah 20 Aspects of Data Loading StrategiesAspects of Data Loading Strategies  Need to look at:  Data freshness  System performance  Data volatility  Data Freshness  Very fresh low update efficiency  Historical data, high update efficiency  Always trade-offs in the light of goals  System performance  Availability of staging table space  Impact on query workload  Data Volatility  Ratio of new to historical data  High percentages of data change (batch update)
  • 21. Ahsan Abdullah 21 Three Loading StrategiesThree Loading Strategies  Once we have transformed data, there are threeOnce we have transformed data, there are three primary loading strategies:primary loading strategies:  Full data refreshFull data refresh with BLOCK INSERT or ‘blockwith BLOCK INSERT or ‘block slamming’ into empty table.slamming’ into empty table.  Incremental data refreshIncremental data refresh with BLOCK INSERT orwith BLOCK INSERT or ‘block slamming’ into existing (populated) tables.‘block slamming’ into existing (populated) tables.  Trickle/continuous feedTrickle/continuous feed with constant datawith constant data collection and loading using row level insert andcollection and loading using row level insert and update operations.update operations.

Editor's Notes

  1. <number>