SlideShare a Scribd company logo
1 of 41
Abdul Khaliq
ETL
Need To Know:
• What is data Ware house?
• Why They Are Necessary?
• How They Are Constructed?
Data Warehouse:
• A physical repository where relational data are specially organized to
provide enterprise-wide, cleansed data in a standardized format
– Subject-oriented: e.g. customers, patients, students, products
– Integrated: Consistent naming conventions, formats, encoding
structures; from multiple data sources
– Time-variant: Can study trends and changes
– Non-updatable: Read-only, periodically refreshed
What is data Ware house?
– A data warehouse centralizes data that are scattered
throughout disparate operational systems and makes them
available for Decision Support.
• A subject-oriented, integrated, time-variant, non-updatable collection of
data used in support of management decision-making processes
Why They Are Necessary?
Operational System(OLTP) Informational System(OLAP)
Needs The reconciliation of data!!
5
Reconciled data: detailed,
current data intended to be the
single, authoritative source for
all decision support.
Data Reconciliation
How They Are Constructed?
How They Are Constructed?
8
• Extract
• Transform
• Load
The ETL Process
Data is:
extracted from an OLTP database(Relational)
transformed to match the data warehouse schema
loaded into the data warehouse
Typical operational data is:
– Transient – not historical
– Not normalized (perhaps due
to de-normalization for
performance)
– Restricted in scope – not
comprehensive
– Sometimes poor quality –
inconsistencies and errors
Purpose Of ETL Process:
After ETL, data should be:
Detailed – not summarized yet
Historical – periodic
Normalized – 3rd normal form or
higher
Comprehensive – enterprise-wide
perspective
Timely – data should be current
enough to assist decision-making
Quality controlled – accurate with
full integrity
10
11
Extraction
The Extract step covers the data extraction from the source system and
makes it accessible for further processing. The main objective of the
extract step is to retrieve all the required data from the source system
with as little resources as possible.
Logically data can be extracted in to
ways before physical data extraction
Full Extraction: Full extraction is used when the data needs to be extracted and
loaded for the first time. In full extraction, the data from the source is extracted
completely. This extraction reflects the current data available in the source system.
Incremental Extraction: In incremental extraction, the changes in source data
need to be tracked since the last successful extraction. Only these changes in data
will be extracted and then loaded. These changes can be detected from the source
data which have the last changed timestamp. Also a change table can be created
in the source system, which keeps track of the changes in the source data.
One more method to get the incremental changes is to extract the complete source
data and then do a difference (minus operation) between the current extraction and
last extraction. This approach causes a performance issue.
Logical Extraction:
The data can be extracted physically by two methods:
Online Extraction: In online extraction the data is extracted directly from the
source system. The extraction process connects to the source system and extracts
the source data. Here the data is extracted directly from the Source for processing
in the staging area, that’s why it’s called online extraction. During Extraction we
connect directly to the source system and then access the source tables. There is
no need of any external staging area
Offline Extraction: The data from the source system is dumped outside of the
source system into a flat file. This flat file is used to extract the data. The flat file
can be created by a routine process daily. Here the data is not extracted directly
from the source, but instead it’s taken from another external area which keeps the
copy of source. The external area can be Flat files, or some dump files in a
specific format. So when we need to process the data we can fetch the records
from the external source instead of the actual source.
Physical extraction:
What data Should be extracted?
• The selection and analysis of the source system is usually broken into
two major phases:
– The data discovery phase
– The anomaly detection phase
Extraction - Data Discovery Phase
• Data Discovery Phase
key criterion for the success of the data warehouse is the cleanliness
and cohesiveness of the data within it
• Once you understand what the target needs to look like, you need to
identify and examine the data sources
Data Discovery Phase
• It is up to the ETL team to drill down further into the data requirements to determine each
and every source system, table, and attribute required to load the data warehouse
– Collecting and Documenting Source Systems
– Keeping track of source systems
– Determining the System of Record - Point of originating of data
– Definition of the system-of-record is important because in most enterprises data is stored
redundantly across many different systems.
– Enterprises do this to make nonintegrated systems share data. It is very common that the same
piece of data is copied, moved, manipulated, transformed, altered, cleansed, or made corrupt
throughout the enterprise, resulting in varying versions of the same data
Data Content Analysis - Extraction
• Understanding the content of the data is crucial for determining the best approach for retrieval
- NULL values. An unhandled NULL value can destroy any ETL process. NULL values pose
the biggest risk when they are in foreign key columns. Joining two or more tables based on a
column that contains NULL values will cause data loss! Remember, in a relational database
NULL is not equal to NULL. That is why those joins fail. Check for NULL values in every foreign
key in the source database. When NULL values are present, you must outer join the tables
- Dates in non date fields. Dates are very peculiar elements because they are the only logical
elements that can come in various formats, literally containing different values and having the
exact same meaning. Fortunately, most database systems support most of the various formats
for display purposes but store them in a single standard format
19
20
Data Transformation
• Data transformation is the component of data reconcilation that converts
data from the format of the source operational systems to the format of
enterprise data warehouse.
• Data transformation consists of a variety of different functions:
– record-level functions,
– field-level functions and
– more complex transformation.
21
Record-level functions & Field-level functions
• Record-level functions
– Selection: data partitioning
– Joining: data combining
– Normalization
– Aggregation: data summarization, Aggregates
• Field-level functions
– Single-field transformation: from one field to one field
– Multi-field transformation: from many fields to one, or one field to many
Transformation
• Main step where the ETL adds value
• Actually changes data and provides guidance whether data can be
used for its intended purposes
• Performed in staging area
23
Need Of Staging Area:
It makes it possible to restart
At least, some of the phases independently from the others.
For example, if the transformation step fails,
it should not be necessary to restart the Extract step.
The staging area should is be accessed by the load ETL process only.
It should never be available to anyone else,
particularly not to end users as it is not intended for data presentation to
the end-user may contain incomplete or in-the-middle-of-the-processing data.
Staging means that the data is simply dumped to the
location (called the Staging Area)
so that it can then be read by the next processing
phase.
The staging area is used during ETL
process to store intermediate results of
processing.
Data Quality paradigm
• Correct
• Unambiguous/Clear
• Consistent
• Complete
• Data quality checks are run at 2 places - after extraction and
after cleaning and confirming additional check are run at this
point
Transformation
26
Transformation - Cleaning Data
• Anomaly Detection
– Data sampling – count(*) of the rows for a department column
• Column Property Enforcement
– Null Values in columns
– Numeric values that fall outside of expected high and lows
– Columns whose lengths are exceptionally short/long
– Columns with certain values outside of discrete valid value sets
• The cleaning step is one of the most important as it ensures the quality of the data in
the data warehouse. Cleaning should perform basic data unification rules, such as:
• Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
• Convert null values into standardized Not Available/Not Provided value
• Convert phone numbers, ZIP codes to a standardized form
• Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str.
• Validate address fields against each other (State/Country, City/State, City/ZIP code,
City/Street).
Cleansing
• Fixing errors: misspellings, erroneous dates, incorrect field usage,
mismatched addresses, missing data, duplicate data, inconsistencies
Cleansing
Also: decoding, reformatting, time stamping,
conversion, key generation, merging, error
detection/logging, locating missing data
Cleansing Further Leads to ETL
process
Staged Data
Cleaning
And
Confirming
Errors
Stop
Loading
Yes
No
Transformation - Confirming
• Structure Enforcement
– Tables have proper primary and foreign keys
– Obey referential integrity
• Data and Rule value enforcement
– Simple business rules
– Logical data checks
32
During the load step, it is necessary to ensure that the load is performed
correctly and with as little resources as possible. The target of the Load
process is often a database. In order to make the load process efficient,
it is helpful to disable any constraints and indexes before the load and
enable them back only after the load completes. The referential integrity
needs to be maintained by ETL tool to ensure consistency.
Data Loading:
Loading Can Be…..
Full Load is the entire data dump load
taking place the very first time. In this we
give the last extract date as empty so that
all the data gets loaded
Incremental - Where delta or difference
between target and source data is dumped
at regular intervals. Here we give the last
extract date such that only records after
this date are loaded.
Why Incremental?
Speed. Opting to do a full load on larger datasets will take a
great amount of time and other server resources. Ideally all
the data loads are performed overnight with the expectation
of completing them before users can see the data the next
day. The overnight window may not be enough time for the
full load to complete.
Preserving history. When dealing with a OLTP source that is
not designed to keep history, a full load will remove history
from the destination as well, since full load will remove all the
records first, remember! So a full load will not allow you to
preserve history in the data warehouse.
Full Load vs. Incremental Load:
Full Load Incremental Load
Truncates all rows and
loads from scratch.
New records and updated
ones are loaded
Requires more time. Requires less time.
Can easily be guaranteed Difficult. ETL must check
for new/updated rows.
Can be lost. Retained.
Loading Includes:
• Loading Dimensions
• Loading Facts
Dimensions
• Qualifying characteristics that provide additional perspectives to a
given fact
– DSS data is almost always viewed in relation to other data
• Dimensions are normally stored in dimension tables
Facts
• Numeric measurements (values) that represent a specific
business aspect or activity
• Stored in a fact table at the center of the star scheme
• Contains facts that are linked through their dimensions
• Can be computed or derived at run time
• Updated periodically with data from operational databases
E
T
L
Thank You !

More Related Content

What's hot

What's hot (20)

ETL Testing Overview
ETL Testing OverviewETL Testing Overview
ETL Testing Overview
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
OLAP v/s OLTP
OLAP v/s OLTPOLAP v/s OLTP
OLAP v/s OLTP
 
ETL VS ELT.pdf
ETL VS ELT.pdfETL VS ELT.pdf
ETL VS ELT.pdf
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
ETL Technologies.pptx
ETL Technologies.pptxETL Technologies.pptx
ETL Technologies.pptx
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 
Online analytical processing
Online analytical processingOnline analytical processing
Online analytical processing
 
Ppt
PptPpt
Ppt
 
Oracle database introduction
Oracle database introductionOracle database introduction
Oracle database introduction
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
DATA WAREHOUSE -- ETL testing Plan
DATA WAREHOUSE -- ETL testing PlanDATA WAREHOUSE -- ETL testing Plan
DATA WAREHOUSE -- ETL testing Plan
 
Testing data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti BhushanTesting data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti Bhushan
 

Similar to Etl - Extract Transform Load

data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...aasifkuchey85
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptRafiulHasan19
 
Get started with data migration
Get started with data migrationGet started with data migration
Get started with data migrationThinqloud
 
ETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL TestingETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL TestingVibrant Event
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingVibrant Event
 
Data warehouse introduction
Data warehouse introductionData warehouse introduction
Data warehouse introductionMurli Jha
 
Ch1 data-warehousing
Ch1 data-warehousingCh1 data-warehousing
Ch1 data-warehousingAhmad Shlool
 
Ch1 data-warehousing
Ch1 data-warehousingCh1 data-warehousing
Ch1 data-warehousingAhmad Shlool
 
AIS PPt.pptx
AIS PPt.pptxAIS PPt.pptx
AIS PPt.pptxdereje33
 

Similar to Etl - Extract Transform Load (20)

data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...
 
Chapter 6.pptx
Chapter 6.pptxChapter 6.pptx
Chapter 6.pptx
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
 
Get started with data migration
Get started with data migrationGet started with data migration
Get started with data migration
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testing
 
ETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL TestingETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL Testing
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testing
 
Data warehouse introduction
Data warehouse introductionData warehouse introduction
Data warehouse introduction
 
ETL-Datawarehousing.ppt.pptx
ETL-Datawarehousing.ppt.pptxETL-Datawarehousing.ppt.pptx
ETL-Datawarehousing.ppt.pptx
 
Data warehouse physical design
Data warehouse physical designData warehouse physical design
Data warehouse physical design
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing Datastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
 
Ch1 data-warehousing
Ch1 data-warehousingCh1 data-warehousing
Ch1 data-warehousing
 
Ch1 data-warehousing
Ch1 data-warehousingCh1 data-warehousing
Ch1 data-warehousing
 
Data Management
Data ManagementData Management
Data Management
 
AIS PPt.pptx
AIS PPt.pptxAIS PPt.pptx
AIS PPt.pptx
 
Database migration
Database migrationDatabase migration
Database migration
 
Data stage
Data stageData stage
Data stage
 

Recently uploaded

Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIkoyaldeepu123
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 

Recently uploaded (20)

Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AI
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 

Etl - Extract Transform Load

  • 2. Need To Know: • What is data Ware house? • Why They Are Necessary? • How They Are Constructed?
  • 3. Data Warehouse: • A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format – Subject-oriented: e.g. customers, patients, students, products – Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources – Time-variant: Can study trends and changes – Non-updatable: Read-only, periodically refreshed What is data Ware house?
  • 4. – A data warehouse centralizes data that are scattered throughout disparate operational systems and makes them available for Decision Support. • A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes Why They Are Necessary? Operational System(OLTP) Informational System(OLAP) Needs The reconciliation of data!!
  • 5. 5 Reconciled data: detailed, current data intended to be the single, authoritative source for all decision support. Data Reconciliation
  • 6. How They Are Constructed?
  • 7. How They Are Constructed?
  • 8. 8 • Extract • Transform • Load The ETL Process Data is: extracted from an OLTP database(Relational) transformed to match the data warehouse schema loaded into the data warehouse
  • 9. Typical operational data is: – Transient – not historical – Not normalized (perhaps due to de-normalization for performance) – Restricted in scope – not comprehensive – Sometimes poor quality – inconsistencies and errors Purpose Of ETL Process: After ETL, data should be: Detailed – not summarized yet Historical – periodic Normalized – 3rd normal form or higher Comprehensive – enterprise-wide perspective Timely – data should be current enough to assist decision-making Quality controlled – accurate with full integrity
  • 10. 10
  • 11. 11
  • 12. Extraction The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible. Logically data can be extracted in to ways before physical data extraction
  • 13. Full Extraction: Full extraction is used when the data needs to be extracted and loaded for the first time. In full extraction, the data from the source is extracted completely. This extraction reflects the current data available in the source system. Incremental Extraction: In incremental extraction, the changes in source data need to be tracked since the last successful extraction. Only these changes in data will be extracted and then loaded. These changes can be detected from the source data which have the last changed timestamp. Also a change table can be created in the source system, which keeps track of the changes in the source data. One more method to get the incremental changes is to extract the complete source data and then do a difference (minus operation) between the current extraction and last extraction. This approach causes a performance issue. Logical Extraction:
  • 14. The data can be extracted physically by two methods: Online Extraction: In online extraction the data is extracted directly from the source system. The extraction process connects to the source system and extracts the source data. Here the data is extracted directly from the Source for processing in the staging area, that’s why it’s called online extraction. During Extraction we connect directly to the source system and then access the source tables. There is no need of any external staging area Offline Extraction: The data from the source system is dumped outside of the source system into a flat file. This flat file is used to extract the data. The flat file can be created by a routine process daily. Here the data is not extracted directly from the source, but instead it’s taken from another external area which keeps the copy of source. The external area can be Flat files, or some dump files in a specific format. So when we need to process the data we can fetch the records from the external source instead of the actual source. Physical extraction:
  • 15. What data Should be extracted? • The selection and analysis of the source system is usually broken into two major phases: – The data discovery phase – The anomaly detection phase
  • 16. Extraction - Data Discovery Phase • Data Discovery Phase key criterion for the success of the data warehouse is the cleanliness and cohesiveness of the data within it • Once you understand what the target needs to look like, you need to identify and examine the data sources
  • 17. Data Discovery Phase • It is up to the ETL team to drill down further into the data requirements to determine each and every source system, table, and attribute required to load the data warehouse – Collecting and Documenting Source Systems – Keeping track of source systems – Determining the System of Record - Point of originating of data – Definition of the system-of-record is important because in most enterprises data is stored redundantly across many different systems. – Enterprises do this to make nonintegrated systems share data. It is very common that the same piece of data is copied, moved, manipulated, transformed, altered, cleansed, or made corrupt throughout the enterprise, resulting in varying versions of the same data
  • 18. Data Content Analysis - Extraction • Understanding the content of the data is crucial for determining the best approach for retrieval - NULL values. An unhandled NULL value can destroy any ETL process. NULL values pose the biggest risk when they are in foreign key columns. Joining two or more tables based on a column that contains NULL values will cause data loss! Remember, in a relational database NULL is not equal to NULL. That is why those joins fail. Check for NULL values in every foreign key in the source database. When NULL values are present, you must outer join the tables - Dates in non date fields. Dates are very peculiar elements because they are the only logical elements that can come in various formats, literally containing different values and having the exact same meaning. Fortunately, most database systems support most of the various formats for display purposes but store them in a single standard format
  • 19. 19
  • 20. 20 Data Transformation • Data transformation is the component of data reconcilation that converts data from the format of the source operational systems to the format of enterprise data warehouse. • Data transformation consists of a variety of different functions: – record-level functions, – field-level functions and – more complex transformation.
  • 21. 21 Record-level functions & Field-level functions • Record-level functions – Selection: data partitioning – Joining: data combining – Normalization – Aggregation: data summarization, Aggregates • Field-level functions – Single-field transformation: from one field to one field – Multi-field transformation: from many fields to one, or one field to many
  • 22. Transformation • Main step where the ETL adds value • Actually changes data and provides guidance whether data can be used for its intended purposes • Performed in staging area
  • 23. 23
  • 24. Need Of Staging Area: It makes it possible to restart At least, some of the phases independently from the others. For example, if the transformation step fails, it should not be necessary to restart the Extract step. The staging area should is be accessed by the load ETL process only. It should never be available to anyone else, particularly not to end users as it is not intended for data presentation to the end-user may contain incomplete or in-the-middle-of-the-processing data. Staging means that the data is simply dumped to the location (called the Staging Area) so that it can then be read by the next processing phase. The staging area is used during ETL process to store intermediate results of processing.
  • 25. Data Quality paradigm • Correct • Unambiguous/Clear • Consistent • Complete • Data quality checks are run at 2 places - after extraction and after cleaning and confirming additional check are run at this point Transformation
  • 26. 26
  • 27. Transformation - Cleaning Data • Anomaly Detection – Data sampling – count(*) of the rows for a department column • Column Property Enforcement – Null Values in columns – Numeric values that fall outside of expected high and lows – Columns whose lengths are exceptionally short/long – Columns with certain values outside of discrete valid value sets
  • 28. • The cleaning step is one of the most important as it ensures the quality of the data in the data warehouse. Cleaning should perform basic data unification rules, such as: • Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not Available are translated to standard Male/Female/Unknown) • Convert null values into standardized Not Available/Not Provided value • Convert phone numbers, ZIP codes to a standardized form • Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str. • Validate address fields against each other (State/Country, City/State, City/ZIP code, City/Street). Cleansing
  • 29. • Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies Cleansing Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data
  • 30. Cleansing Further Leads to ETL process Staged Data Cleaning And Confirming Errors Stop Loading Yes No
  • 31. Transformation - Confirming • Structure Enforcement – Tables have proper primary and foreign keys – Obey referential integrity • Data and Rule value enforcement – Simple business rules – Logical data checks
  • 32. 32
  • 33. During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database. In order to make the load process efficient, it is helpful to disable any constraints and indexes before the load and enable them back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency. Data Loading:
  • 34. Loading Can Be….. Full Load is the entire data dump load taking place the very first time. In this we give the last extract date as empty so that all the data gets loaded Incremental - Where delta or difference between target and source data is dumped at regular intervals. Here we give the last extract date such that only records after this date are loaded.
  • 35. Why Incremental? Speed. Opting to do a full load on larger datasets will take a great amount of time and other server resources. Ideally all the data loads are performed overnight with the expectation of completing them before users can see the data the next day. The overnight window may not be enough time for the full load to complete. Preserving history. When dealing with a OLTP source that is not designed to keep history, a full load will remove history from the destination as well, since full load will remove all the records first, remember! So a full load will not allow you to preserve history in the data warehouse.
  • 36. Full Load vs. Incremental Load: Full Load Incremental Load Truncates all rows and loads from scratch. New records and updated ones are loaded Requires more time. Requires less time. Can easily be guaranteed Difficult. ETL must check for new/updated rows. Can be lost. Retained.
  • 37. Loading Includes: • Loading Dimensions • Loading Facts
  • 38. Dimensions • Qualifying characteristics that provide additional perspectives to a given fact – DSS data is almost always viewed in relation to other data • Dimensions are normally stored in dimension tables
  • 39. Facts • Numeric measurements (values) that represent a specific business aspect or activity • Stored in a fact table at the center of the star scheme • Contains facts that are linked through their dimensions • Can be computed or derived at run time • Updated periodically with data from operational databases
  • 40. E T L