SlideShare a Scribd company logo
-
1
Data Warehousing
ETL Detail: Data Extraction & Transformation
Ch Anwar ul Hassan (Lecturer)
Department of Computer Science and Software Engineering
Capital University of Sciences & Technology, Islamabad Pakista
anwarchaudary@gmail.com
-
2
ETL Detail: Data Extraction &
Transformation
-
3
Extracting Changed Data
Incremental data extraction
Incremental data extraction i.e. what has changed, say during last 24
hrs if considering nightly extraction.
Efficient when changes can be identified
This is efficient, when the small changed data can be identified
efficiently.
Identification could be costly
Unfortunately, for many source systems, identifying the recently
modified data may be difficult or effect operation of the source
system.
Very challenging
Change Data Capture is therefore, typically the most challenging
technical issue in data extraction.
-
4
Source Systems
Two CDC sources
• Modern systems
• Legacy systems
-
5
CDC in Modern Systems
• Time Stamps
• Works if timestamp column present
• If column not present, add column
• May not be possible to modify table, so add triggers
• Triggers
• Create trigger for each source table
• Following each DML operation trigger performs updates
• Record DML operations in a log
• Partitioning
• Table range partitioned, say along date key
• Easy to identify new data, say last week’s data
-
6
CDC in Legacy Systems
 Changes recorded in tapes Changes occurred in legacy
transaction processing are recorded on the log or journal
tapes.
 Changes read and removed from tapes Log or journal tape are
read and the update/transaction changes are stripped off for
movement into the data warehouse.
 Problems with reading a log/journal tape are many:
 Contains lot of extraneous data
 Format is often arcane
 Often contains addresses instead of data values and keys
 Sequencing of data in the log tape often has deep and complex
implications
 Log tape varies widely from one DBMS to another.
-
7
Advantages
1. Immediate.
2. No loss of history
3. Flat files NOT required
CDC Advantages: Modern Systems
Modern
Systems
-
8
Advantages
1. No incremental on-line I/O required for log tape
2. The log tape captures all update processing
3. Log tape processing can be taken off-line.
4. No haste to make waste.
CDC Advantages: Legacy Systems
Legacy
Systems
-
9
Major Transformation Types
 Format revision
 Decoding of fields
 Calculated and derived values
 Splitting of single fields
 Merging of information
 Character set conversion
 Unit of measurement conversion
 Date/Time conversion
 Summarization
 Key restructuring
 Duplication
-
10
 Format revision
 Decoding of fields
 Calculated and derived values
 Splitting of single fields
Covered in De-Norm
Major Transformation Types
-
11
 Merging of information
 Character set conversion
 Unit of measurement conversion
 Date/Time conversion
Not really means combining columns to create one column.
Info for product coming from different sources merging it into single entity.
For PC architecture converting legacy EBCIDIC to ASCII
For companies with global branches Km vs. mile or lb vs Kg
November 14, 2005 as 11/14/2005 in US and 14/11/2005 in the British format.
This date may be standardized to be written as 14 NOV 2005.
Major Transformation Types
-
12
 Aggregation & Summarization
 How they are different?
Why both are required?
 Grain mismatch (don’t require, don’t have space)
 Data Marts requiring low detail
 Detail losing its utility
Adding
like values
Summarization with calculation across business dimension is
aggregation. Example Monthly compensation = monthly sale + bonus
Major Transformation Types
-
13
 Key restructuring (inherent meaning at source)
 i.e. 92424979234 changed to 12345678
 Removing duplication
92 42 4979 234
Country_Code City_Code Post_Code Product_Code
Incorrect or missing value
Inconsistent naming convention ONE vs 1
Incomplete information
Physically moved, but address not changed
Misspelling or falsification of names
Major Transformation Types
-
14
Data content defects
• Domain value redundancy
 Non-standard data formats
 Non-atomic data values
 Multipurpose data fields
 Embedded meanings
 Inconsistent data values
 Data quality contamination
-
15
Domain value redundancy
 Unit of Measure
 Dozen, Doz., Dz., 12
 Non-standard data formats
 Phone Numbers
 1234567 or 123.456.7
 Non-atomic data fields
 Name & Addresses
 Dr. Hameed Khan, PhD
Data content defects Examples
-
16
 Embedded Meanings
 RC, AP, RJ
 received, approved, rejected
Data content defects Examples
-
17
ETL Detail: Data Cleansing
-
18
Background
 Other names: Called as data scrubbing or cleaning.
 More than data arranging: DWH is NOT just about arranging data,
but should be clean for overall health of organization. We drink
clean water!
 Big problem, big effect: Enormous problem, as most data is dirty.
GIGO
 Dirty is relative: Dirty means does not confirm to proper domain
definition and vary from domain to domain.
 Paradox: Must involve domain expert, as detailed domain
knowledge is required, so it becomes semi-automatic, but has to be
automatic because of large data sets.
 Data duplication: Original problem was removing duplicates in one
system, compounded by duplicates from many systems.
-
19
Lighter Side of Dirty Data
 Year of birth 1995 current year 2005
 Born in 1986 hired in 1985
 Who would take it seriously? Computers while
summarizing, aggregating, populating etc.
 Small discrepancies become irrelevant for large
averages, but what about sums, medians, maximum,
minimum etc.?
-
20
Serious Side of dirty data
 Decision making at the Government level on
investment based on rate of birth in terms of
schools and then teachers. Wrong data resulting
in over and under investment.
 Direct mail marketing sending letters to wrong
addresses retuned, or multiple letters to same
address, loss of money and bad reputation and
wrong identification of marketing region.
-
21
3 Classes of Anomalies…
 Syntactically Dirty Data
 Lexical Errors
 Irregularities
 Semantically Dirty Data
 Integrity Constraint Violation
 Business rule contradiction
 Duplication
 Coverage Anomalies
 Missing Attributes
 Missing Records
-
22
3 Classes of Anomalies…
 Syntactically Dirty Data
 Lexical Errors
 Discrepancies between the structure of the data items and the specified
format of stored values
 e.g. number of columns used are unexpected for a tuple (mixed up number
of attributes)
 Irregularities
 Non uniform use of units and values, such as only giving annual salary but
without info i.e. in US$ or PK Rs?
 Semantically Dirty Data
 Integrity Constraint violation (Dept. 5)
 Contradiction
 DoB > Hiring date etc.
 Duplication
-
23
 Coverage or lack of it
 Missing Attribute
 Result of omissions while collecting the data.
 A constraint violation if we have null values for attributes where
NOT NULL constraint exists.
 Case more complicated where no such constraint exists.
 Have to decide whether the value exists in the real world and
has to be deduced here or not.
3 Classes of Anomalies…
-
24
Why Coverage Anomalies?
 Equipment malfunction (bar code reader, keyboard etc.)
 Inconsistent with other recorded data and thus deleted.
 Data not entered due to misunderstanding/illegibility.
 Data not considered important at the time of entry (e.g. Y2K).
-
25
 Dropping records.
 “Manually” filling missing values.
 Using a global constant as filler.
 Using the attribute mean (or median) as filler.
 Using the most probable value as filler.
Handling missing data
-
26
Key Based Classification of Problems
 Primary key problems
 Non-Primary key problems
-
27
Primary key problems
 Same PK but different data.
 Same entity with different keys.
 PK in one system but not in other.
 Same PK but in different formats.
-
28
Non primary key problems…
 Different encoding in different sources.
 Multiple ways to represent the same
information.
 Sources might contain invalid data.
 Two fields with different data but same
name.
-
29
 Required fields left blank.
 Data erroneous or incomplete.
 Data contains null values.
Non primary key problems
-
30
Automatic Data Cleansing…
1.Statistical
2.Pattern Based
3.Clustering
4.Association Rules
-
31
1. Statistical Methods
 Identifying outlier fields and records using the values of
mean, standard deviation, range, etc., based on
Chebyshev’s theorem
2. Pattern-based
 Identify outlier fields and records that do not conform to
existing patterns in the data.
 A pattern is defined by a group of records that have similar
characteristics (“behavior”) for p% of the fields in the data
set, where p is a user-defined value (usually above 90).
 Techniques such as partitioning, classification, and
clustering can be used to identify patterns that apply to
most records.
Automatic Data Cleansing…
-
32
3. Clustering
 Identify outlier records using clustering based on Euclidian (or
other) distance.
 Clustering the entire record space can reveal outliers that are
not identified at the field level inspection
 Main drawback of this method is computational time.
4. Association rules
 Association rules with high confidence and support define a
different kind of pattern.
 Records that do not follow these rules are considered outliers.
Automatic Data Cleansing

More Related Content

What's hot

Mc0077 – advanced database systems
Mc0077 – advanced database systemsMc0077 – advanced database systems
Mc0077 – advanced database systems
Rabby Bhatt
 
RES814 U1 Individual Project
RES814 U1 Individual ProjectRES814 U1 Individual Project
RES814 U1 Individual Project
ThienSi Le
 
Elements of Data Documentation
Elements of Data DocumentationElements of Data Documentation
Elements of Data Documentation
ssri-duke
 
RDMS AND SQL
RDMS AND SQLRDMS AND SQL
RDMS AND SQL
milanmehta7
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
Steven Johnson
 
Week 4 The Relational Data Model & The Entity Relationship Data Model
Week 4 The Relational Data Model & The Entity Relationship Data ModelWeek 4 The Relational Data Model & The Entity Relationship Data Model
Week 4 The Relational Data Model & The Entity Relationship Data Model
oudesign
 
Database intro
Database introDatabase intro
Database intro
varsha nihanth lade
 
Intro To DataBase
Intro To DataBaseIntro To DataBase
Intro To DataBase
DevMix
 
Data documentation and retrieval using unity in a universe®
Data documentation and retrieval using unity in a universe®Data documentation and retrieval using unity in a universe®
Data documentation and retrieval using unity in a universe®
ANIL247048
 
Unit I Database concepts - RDBMS & ORACLE
Unit I  Database concepts - RDBMS & ORACLEUnit I  Database concepts - RDBMS & ORACLE
Unit I Database concepts - RDBMS & ORACLE
DrkhanchanaR
 
Database Basics Theory
Database Basics TheoryDatabase Basics Theory
Database Basics Theory
sunmitraeducation
 
Z04506138145
Z04506138145Z04506138145
Z04506138145
IJERA Editor
 
Unit 08 dbms
Unit 08 dbmsUnit 08 dbms
Unit 08 dbms
anuragmbst
 
Database basics
Database basicsDatabase basics
Database basics
prachin514
 
Introduction to databases
Introduction to databasesIntroduction to databases
Introduction to databases
Bryan Corpuz
 
Database design
Database designDatabase design
Database design
Jennifer Polack
 
DISE - Database Concepts
DISE - Database ConceptsDISE - Database Concepts
DISE - Database Concepts
Rasan Samarasinghe
 
Relational databases
Relational databasesRelational databases
Relational databases
Fiddy Prasetiya
 
CS828 P5 Individual Project v101
CS828 P5 Individual Project v101CS828 P5 Individual Project v101
CS828 P5 Individual Project v101
ThienSi Le
 

What's hot (19)

Mc0077 – advanced database systems
Mc0077 – advanced database systemsMc0077 – advanced database systems
Mc0077 – advanced database systems
 
RES814 U1 Individual Project
RES814 U1 Individual ProjectRES814 U1 Individual Project
RES814 U1 Individual Project
 
Elements of Data Documentation
Elements of Data DocumentationElements of Data Documentation
Elements of Data Documentation
 
RDMS AND SQL
RDMS AND SQLRDMS AND SQL
RDMS AND SQL
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
Week 4 The Relational Data Model & The Entity Relationship Data Model
Week 4 The Relational Data Model & The Entity Relationship Data ModelWeek 4 The Relational Data Model & The Entity Relationship Data Model
Week 4 The Relational Data Model & The Entity Relationship Data Model
 
Database intro
Database introDatabase intro
Database intro
 
Intro To DataBase
Intro To DataBaseIntro To DataBase
Intro To DataBase
 
Data documentation and retrieval using unity in a universe®
Data documentation and retrieval using unity in a universe®Data documentation and retrieval using unity in a universe®
Data documentation and retrieval using unity in a universe®
 
Unit I Database concepts - RDBMS & ORACLE
Unit I  Database concepts - RDBMS & ORACLEUnit I  Database concepts - RDBMS & ORACLE
Unit I Database concepts - RDBMS & ORACLE
 
Database Basics Theory
Database Basics TheoryDatabase Basics Theory
Database Basics Theory
 
Z04506138145
Z04506138145Z04506138145
Z04506138145
 
Unit 08 dbms
Unit 08 dbmsUnit 08 dbms
Unit 08 dbms
 
Database basics
Database basicsDatabase basics
Database basics
 
Introduction to databases
Introduction to databasesIntroduction to databases
Introduction to databases
 
Database design
Database designDatabase design
Database design
 
DISE - Database Concepts
DISE - Database ConceptsDISE - Database Concepts
DISE - Database Concepts
 
Relational databases
Relational databasesRelational databases
Relational databases
 
CS828 P5 Individual Project v101
CS828 P5 Individual Project v101CS828 P5 Individual Project v101
CS828 P5 Individual Project v101
 

Similar to Intro to Data warehousing lecture 10

When & Why\'s of Denormalization
When & Why\'s of DenormalizationWhen & Why\'s of Denormalization
When & Why\'s of Denormalization
Aliya Saldanha
 
Etl - Extract Transform Load
Etl - Extract Transform LoadEtl - Extract Transform Load
Etl - Extract Transform Load
ABDUL KHALIQ
 
Basic and Introduction to DBMS Unit 1 of AU
Basic and Introduction to DBMS Unit 1 of AUBasic and Introduction to DBMS Unit 1 of AU
Basic and Introduction to DBMS Unit 1 of AU
infant2404
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
Neo4j
 
It 302 computerized accounting (week 2) - sharifah
It 302   computerized accounting (week 2) - sharifahIt 302   computerized accounting (week 2) - sharifah
It 302 computerized accounting (week 2) - sharifah
alish sha
 
database1.pdf
database1.pdfdatabase1.pdf
database1.pdf
prashanna13
 
Data processing in Industrial Systems course notes after week 5
Data processing in Industrial Systems course notes after week 5Data processing in Industrial Systems course notes after week 5
Data processing in Industrial Systems course notes after week 5
Ufuk Cebeci
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
JesusaEspeleta
 
Data Warehouse ( Dw Of Dwh )
Data Warehouse ( Dw Of Dwh )Data Warehouse ( Dw Of Dwh )
Data Warehouse ( Dw Of Dwh )
Jenny Calhoon
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
Dev EngineersSaathi
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
Dios Kurniawan
 
7 - Enterprise IT in Action
7 - Enterprise IT in Action7 - Enterprise IT in Action
7 - Enterprise IT in Action
Raymond Gao
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
anita desiani
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
Costa Pissaris
 
Introduction to Databases by Dr. Kamal Gulati
Introduction to Databases by Dr. Kamal GulatiIntroduction to Databases by Dr. Kamal Gulati
Data models
Data modelsData models
Data models
Usman Tariq
 
Lecture 18
Lecture 18Lecture 18
Lecture 18
Shani729
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).ppt
kannaradhas
 
Dqs mds-matching 15042015
Dqs mds-matching 15042015Dqs mds-matching 15042015
Dqs mds-matching 15042015
Neil Hambly
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
Yugal Kumar
 

Similar to Intro to Data warehousing lecture 10 (20)

When & Why\'s of Denormalization
When & Why\'s of DenormalizationWhen & Why\'s of Denormalization
When & Why\'s of Denormalization
 
Etl - Extract Transform Load
Etl - Extract Transform LoadEtl - Extract Transform Load
Etl - Extract Transform Load
 
Basic and Introduction to DBMS Unit 1 of AU
Basic and Introduction to DBMS Unit 1 of AUBasic and Introduction to DBMS Unit 1 of AU
Basic and Introduction to DBMS Unit 1 of AU
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
 
It 302 computerized accounting (week 2) - sharifah
It 302   computerized accounting (week 2) - sharifahIt 302   computerized accounting (week 2) - sharifah
It 302 computerized accounting (week 2) - sharifah
 
database1.pdf
database1.pdfdatabase1.pdf
database1.pdf
 
Data processing in Industrial Systems course notes after week 5
Data processing in Industrial Systems course notes after week 5Data processing in Industrial Systems course notes after week 5
Data processing in Industrial Systems course notes after week 5
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
 
Data Warehouse ( Dw Of Dwh )
Data Warehouse ( Dw Of Dwh )Data Warehouse ( Dw Of Dwh )
Data Warehouse ( Dw Of Dwh )
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
 
7 - Enterprise IT in Action
7 - Enterprise IT in Action7 - Enterprise IT in Action
7 - Enterprise IT in Action
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
 
Introduction to Databases by Dr. Kamal Gulati
Introduction to Databases by Dr. Kamal GulatiIntroduction to Databases by Dr. Kamal Gulati
Introduction to Databases by Dr. Kamal Gulati
 
Data models
Data modelsData models
Data models
 
Lecture 18
Lecture 18Lecture 18
Lecture 18
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).ppt
 
Dqs mds-matching 15042015
Dqs mds-matching 15042015Dqs mds-matching 15042015
Dqs mds-matching 15042015
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 

More from AnwarrChaudary

Intro to Data warehousing lecture 20
Intro to Data warehousing   lecture 20Intro to Data warehousing   lecture 20
Intro to Data warehousing lecture 20
AnwarrChaudary
 
Intro to Data warehousing lecture 19
Intro to Data warehousing   lecture 19Intro to Data warehousing   lecture 19
Intro to Data warehousing lecture 19
AnwarrChaudary
 
Intro to Data warehousing lecture 18
Intro to Data warehousing   lecture 18Intro to Data warehousing   lecture 18
Intro to Data warehousing lecture 18
AnwarrChaudary
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
AnwarrChaudary
 
Intro to Data warehousing lecture 16
Intro to Data warehousing   lecture 16Intro to Data warehousing   lecture 16
Intro to Data warehousing lecture 16
AnwarrChaudary
 
Intro to Data warehousing lecture 15
Intro to Data warehousing   lecture 15Intro to Data warehousing   lecture 15
Intro to Data warehousing lecture 15
AnwarrChaudary
 
Intro to Data warehousing lecture 14
Intro to Data warehousing   lecture 14Intro to Data warehousing   lecture 14
Intro to Data warehousing lecture 14
AnwarrChaudary
 
Intro to Data warehousing lecture 13
Intro to Data warehousing   lecture 13Intro to Data warehousing   lecture 13
Intro to Data warehousing lecture 13
AnwarrChaudary
 
Intro to Data warehousing lecture 12
Intro to Data warehousing   lecture 12Intro to Data warehousing   lecture 12
Intro to Data warehousing lecture 12
AnwarrChaudary
 
Intro to Data warehousing lecture 11
Intro to Data warehousing   lecture 11Intro to Data warehousing   lecture 11
Intro to Data warehousing lecture 11
AnwarrChaudary
 
Intro to Data warehousing lecture 09
Intro to Data warehousing   lecture 09Intro to Data warehousing   lecture 09
Intro to Data warehousing lecture 09
AnwarrChaudary
 
Intro to Data warehousing lecture 08
Intro to Data warehousing   lecture 08Intro to Data warehousing   lecture 08
Intro to Data warehousing lecture 08
AnwarrChaudary
 
Intro to Data warehousing lecture 07
Intro to Data warehousing   lecture 07Intro to Data warehousing   lecture 07
Intro to Data warehousing lecture 07
AnwarrChaudary
 
Intro to Data warehousing Lecture 06
Intro to Data warehousing   Lecture 06Intro to Data warehousing   Lecture 06
Intro to Data warehousing Lecture 06
AnwarrChaudary
 
Intro to Data warehousing lecture 05
Intro to Data warehousing   lecture 05Intro to Data warehousing   lecture 05
Intro to Data warehousing lecture 05
AnwarrChaudary
 
Intro to Data warehousing Lecture 04
Intro to Data warehousing   Lecture 04Intro to Data warehousing   Lecture 04
Intro to Data warehousing Lecture 04
AnwarrChaudary
 
Intro to Data warehousing lecture 03
Intro to Data warehousing   lecture 03Intro to Data warehousing   lecture 03
Intro to Data warehousing lecture 03
AnwarrChaudary
 
Intro to Data warehousing lecture 02
Intro to Data warehousing   lecture 02Intro to Data warehousing   lecture 02
Intro to Data warehousing lecture 02
AnwarrChaudary
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
AnwarrChaudary
 
Introduction to Software Engineering
Introduction to Software EngineeringIntroduction to Software Engineering
Introduction to Software Engineering
AnwarrChaudary
 

More from AnwarrChaudary (20)

Intro to Data warehousing lecture 20
Intro to Data warehousing   lecture 20Intro to Data warehousing   lecture 20
Intro to Data warehousing lecture 20
 
Intro to Data warehousing lecture 19
Intro to Data warehousing   lecture 19Intro to Data warehousing   lecture 19
Intro to Data warehousing lecture 19
 
Intro to Data warehousing lecture 18
Intro to Data warehousing   lecture 18Intro to Data warehousing   lecture 18
Intro to Data warehousing lecture 18
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
Intro to Data warehousing lecture 16
Intro to Data warehousing   lecture 16Intro to Data warehousing   lecture 16
Intro to Data warehousing lecture 16
 
Intro to Data warehousing lecture 15
Intro to Data warehousing   lecture 15Intro to Data warehousing   lecture 15
Intro to Data warehousing lecture 15
 
Intro to Data warehousing lecture 14
Intro to Data warehousing   lecture 14Intro to Data warehousing   lecture 14
Intro to Data warehousing lecture 14
 
Intro to Data warehousing lecture 13
Intro to Data warehousing   lecture 13Intro to Data warehousing   lecture 13
Intro to Data warehousing lecture 13
 
Intro to Data warehousing lecture 12
Intro to Data warehousing   lecture 12Intro to Data warehousing   lecture 12
Intro to Data warehousing lecture 12
 
Intro to Data warehousing lecture 11
Intro to Data warehousing   lecture 11Intro to Data warehousing   lecture 11
Intro to Data warehousing lecture 11
 
Intro to Data warehousing lecture 09
Intro to Data warehousing   lecture 09Intro to Data warehousing   lecture 09
Intro to Data warehousing lecture 09
 
Intro to Data warehousing lecture 08
Intro to Data warehousing   lecture 08Intro to Data warehousing   lecture 08
Intro to Data warehousing lecture 08
 
Intro to Data warehousing lecture 07
Intro to Data warehousing   lecture 07Intro to Data warehousing   lecture 07
Intro to Data warehousing lecture 07
 
Intro to Data warehousing Lecture 06
Intro to Data warehousing   Lecture 06Intro to Data warehousing   Lecture 06
Intro to Data warehousing Lecture 06
 
Intro to Data warehousing lecture 05
Intro to Data warehousing   lecture 05Intro to Data warehousing   lecture 05
Intro to Data warehousing lecture 05
 
Intro to Data warehousing Lecture 04
Intro to Data warehousing   Lecture 04Intro to Data warehousing   Lecture 04
Intro to Data warehousing Lecture 04
 
Intro to Data warehousing lecture 03
Intro to Data warehousing   lecture 03Intro to Data warehousing   lecture 03
Intro to Data warehousing lecture 03
 
Intro to Data warehousing lecture 02
Intro to Data warehousing   lecture 02Intro to Data warehousing   lecture 02
Intro to Data warehousing lecture 02
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Introduction to Software Engineering
Introduction to Software EngineeringIntroduction to Software Engineering
Introduction to Software Engineering
 

Recently uploaded

Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
simonomuemu
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
Dr. Mulla Adam Ali
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 

Recently uploaded (20)

Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 

Intro to Data warehousing lecture 10

  • 1. - 1 Data Warehousing ETL Detail: Data Extraction & Transformation Ch Anwar ul Hassan (Lecturer) Department of Computer Science and Software Engineering Capital University of Sciences & Technology, Islamabad Pakista anwarchaudary@gmail.com
  • 2. - 2 ETL Detail: Data Extraction & Transformation
  • 3. - 3 Extracting Changed Data Incremental data extraction Incremental data extraction i.e. what has changed, say during last 24 hrs if considering nightly extraction. Efficient when changes can be identified This is efficient, when the small changed data can be identified efficiently. Identification could be costly Unfortunately, for many source systems, identifying the recently modified data may be difficult or effect operation of the source system. Very challenging Change Data Capture is therefore, typically the most challenging technical issue in data extraction.
  • 4. - 4 Source Systems Two CDC sources • Modern systems • Legacy systems
  • 5. - 5 CDC in Modern Systems • Time Stamps • Works if timestamp column present • If column not present, add column • May not be possible to modify table, so add triggers • Triggers • Create trigger for each source table • Following each DML operation trigger performs updates • Record DML operations in a log • Partitioning • Table range partitioned, say along date key • Easy to identify new data, say last week’s data
  • 6. - 6 CDC in Legacy Systems  Changes recorded in tapes Changes occurred in legacy transaction processing are recorded on the log or journal tapes.  Changes read and removed from tapes Log or journal tape are read and the update/transaction changes are stripped off for movement into the data warehouse.  Problems with reading a log/journal tape are many:  Contains lot of extraneous data  Format is often arcane  Often contains addresses instead of data values and keys  Sequencing of data in the log tape often has deep and complex implications  Log tape varies widely from one DBMS to another.
  • 7. - 7 Advantages 1. Immediate. 2. No loss of history 3. Flat files NOT required CDC Advantages: Modern Systems Modern Systems
  • 8. - 8 Advantages 1. No incremental on-line I/O required for log tape 2. The log tape captures all update processing 3. Log tape processing can be taken off-line. 4. No haste to make waste. CDC Advantages: Legacy Systems Legacy Systems
  • 9. - 9 Major Transformation Types  Format revision  Decoding of fields  Calculated and derived values  Splitting of single fields  Merging of information  Character set conversion  Unit of measurement conversion  Date/Time conversion  Summarization  Key restructuring  Duplication
  • 10. - 10  Format revision  Decoding of fields  Calculated and derived values  Splitting of single fields Covered in De-Norm Major Transformation Types
  • 11. - 11  Merging of information  Character set conversion  Unit of measurement conversion  Date/Time conversion Not really means combining columns to create one column. Info for product coming from different sources merging it into single entity. For PC architecture converting legacy EBCIDIC to ASCII For companies with global branches Km vs. mile or lb vs Kg November 14, 2005 as 11/14/2005 in US and 14/11/2005 in the British format. This date may be standardized to be written as 14 NOV 2005. Major Transformation Types
  • 12. - 12  Aggregation & Summarization  How they are different? Why both are required?  Grain mismatch (don’t require, don’t have space)  Data Marts requiring low detail  Detail losing its utility Adding like values Summarization with calculation across business dimension is aggregation. Example Monthly compensation = monthly sale + bonus Major Transformation Types
  • 13. - 13  Key restructuring (inherent meaning at source)  i.e. 92424979234 changed to 12345678  Removing duplication 92 42 4979 234 Country_Code City_Code Post_Code Product_Code Incorrect or missing value Inconsistent naming convention ONE vs 1 Incomplete information Physically moved, but address not changed Misspelling or falsification of names Major Transformation Types
  • 14. - 14 Data content defects • Domain value redundancy  Non-standard data formats  Non-atomic data values  Multipurpose data fields  Embedded meanings  Inconsistent data values  Data quality contamination
  • 15. - 15 Domain value redundancy  Unit of Measure  Dozen, Doz., Dz., 12  Non-standard data formats  Phone Numbers  1234567 or 123.456.7  Non-atomic data fields  Name & Addresses  Dr. Hameed Khan, PhD Data content defects Examples
  • 16. - 16  Embedded Meanings  RC, AP, RJ  received, approved, rejected Data content defects Examples
  • 18. - 18 Background  Other names: Called as data scrubbing or cleaning.  More than data arranging: DWH is NOT just about arranging data, but should be clean for overall health of organization. We drink clean water!  Big problem, big effect: Enormous problem, as most data is dirty. GIGO  Dirty is relative: Dirty means does not confirm to proper domain definition and vary from domain to domain.  Paradox: Must involve domain expert, as detailed domain knowledge is required, so it becomes semi-automatic, but has to be automatic because of large data sets.  Data duplication: Original problem was removing duplicates in one system, compounded by duplicates from many systems.
  • 19. - 19 Lighter Side of Dirty Data  Year of birth 1995 current year 2005  Born in 1986 hired in 1985  Who would take it seriously? Computers while summarizing, aggregating, populating etc.  Small discrepancies become irrelevant for large averages, but what about sums, medians, maximum, minimum etc.?
  • 20. - 20 Serious Side of dirty data  Decision making at the Government level on investment based on rate of birth in terms of schools and then teachers. Wrong data resulting in over and under investment.  Direct mail marketing sending letters to wrong addresses retuned, or multiple letters to same address, loss of money and bad reputation and wrong identification of marketing region.
  • 21. - 21 3 Classes of Anomalies…  Syntactically Dirty Data  Lexical Errors  Irregularities  Semantically Dirty Data  Integrity Constraint Violation  Business rule contradiction  Duplication  Coverage Anomalies  Missing Attributes  Missing Records
  • 22. - 22 3 Classes of Anomalies…  Syntactically Dirty Data  Lexical Errors  Discrepancies between the structure of the data items and the specified format of stored values  e.g. number of columns used are unexpected for a tuple (mixed up number of attributes)  Irregularities  Non uniform use of units and values, such as only giving annual salary but without info i.e. in US$ or PK Rs?  Semantically Dirty Data  Integrity Constraint violation (Dept. 5)  Contradiction  DoB > Hiring date etc.  Duplication
  • 23. - 23  Coverage or lack of it  Missing Attribute  Result of omissions while collecting the data.  A constraint violation if we have null values for attributes where NOT NULL constraint exists.  Case more complicated where no such constraint exists.  Have to decide whether the value exists in the real world and has to be deduced here or not. 3 Classes of Anomalies…
  • 24. - 24 Why Coverage Anomalies?  Equipment malfunction (bar code reader, keyboard etc.)  Inconsistent with other recorded data and thus deleted.  Data not entered due to misunderstanding/illegibility.  Data not considered important at the time of entry (e.g. Y2K).
  • 25. - 25  Dropping records.  “Manually” filling missing values.  Using a global constant as filler.  Using the attribute mean (or median) as filler.  Using the most probable value as filler. Handling missing data
  • 26. - 26 Key Based Classification of Problems  Primary key problems  Non-Primary key problems
  • 27. - 27 Primary key problems  Same PK but different data.  Same entity with different keys.  PK in one system but not in other.  Same PK but in different formats.
  • 28. - 28 Non primary key problems…  Different encoding in different sources.  Multiple ways to represent the same information.  Sources might contain invalid data.  Two fields with different data but same name.
  • 29. - 29  Required fields left blank.  Data erroneous or incomplete.  Data contains null values. Non primary key problems
  • 30. - 30 Automatic Data Cleansing… 1.Statistical 2.Pattern Based 3.Clustering 4.Association Rules
  • 31. - 31 1. Statistical Methods  Identifying outlier fields and records using the values of mean, standard deviation, range, etc., based on Chebyshev’s theorem 2. Pattern-based  Identify outlier fields and records that do not conform to existing patterns in the data.  A pattern is defined by a group of records that have similar characteristics (“behavior”) for p% of the fields in the data set, where p is a user-defined value (usually above 90).  Techniques such as partitioning, classification, and clustering can be used to identify patterns that apply to most records. Automatic Data Cleansing…
  • 32. - 32 3. Clustering  Identify outlier records using clustering based on Euclidian (or other) distance.  Clustering the entire record space can reveal outliers that are not identified at the field level inspection  Main drawback of this method is computational time. 4. Association rules  Association rules with high confidence and support define a different kind of pattern.  Records that do not follow these rules are considered outliers. Automatic Data Cleansing