SlideShare a Scribd company logo
1 of 39
The material used in this presentation i.e., pictures/graphs/text, etc. is solely
intended for educational/teaching purpose, offered free of cost to the students for
use under special circumstances of Online Education due to COVID-19 Lockdown
situation and may include copyrighted material - the use of which may not have
been specifically authorised by Copyright Owners. It’s application constitutes Fair
Use of any such copyrighted material as provided in globally accepted law of many
countries. The contents of presentations are intended only for the attendees of the
class being conducted by the presenter.
Fair Use Notice
ETL :Data Extraction, Transformation and
Loading
Disclaimer: The contents in this presentation have been taken
from multiple resources available at the internet including
books, notes, reports, websites and presentations.
ETL Overview
• Technical architecture of the data warehouse comprises of three functional
areas. These areas are data acquisition, data storage, and information
delivery.
• Data extraction, transformation, and loading (ETL process) encompass the
areas of data acquisition and data storage.
• These are back-end processes that cover the extraction of data from the
source systems. Next, they include all the functions and procedures for
changing the source data into the exact formats and structures appropriate for
storage in the data warehouse database.
• After the transformation of the data, these processes consist of all the
functions for physically moving the data into the data warehouse repository
• You will not extract all of operational data and dump it into the data
warehouse.
• This approach is something driven by the user requirements. Your
requirements definition should guide you as to what data you need to
extract and from which source systems.
Challenges in ETL
• Source systems are very diverse and disparate.
• There is usually a need to deal with source systems on multiple platforms and
different operating systems.
• Many source systems are older legacy applications running on obsolete
database technologies.
• Generally, historical data on changes in values are not preserved in source
operational systems. Historical information is critical in a data warehouse.
• Quality of data is dubious in many old source systems that have evolved over
time.
• Source system structures keep changing over time because of new business
conditions. ETL functions must also be modified accordingly.
• Gross lack of consistency among source systems is commonly prevalent. Same data is
likely to be represented differently in the various source systems.
• For example, data on salary may be represented as monthly salary, weekly salary,
and bimonthly salary in different source payroll systems.
• Even when inconsistent data is detected among disparate source systems, lack of a
means for resolving mismatches escalates the problem of inconsistency.
• Most source systems do not represent data in types or formats that are meaningful
to the users. Many representations are cryptic and ambiguous.
(This is a prevalent problem in legacy systems, many of which were not designed with
end-users in mind. For example, the customer status codes could have been started
with R = Regular and N = New. Then at one time, another code D = Deceased could
have been added. Down the road, a further code A = Archive could have been
included.)
• Project team spend as much as 50–70% of the project effort on ETL functions.
DATA EXTRACTION
• Data extraction for a new operational system is quite different from the data
extraction for a data warehouse.
• First, for a data warehouse, you have to extract data from many disparate sources.
• Next, for a data warehouse, you have to extract data on the changes for ongoing
incremental loads as well as for a one-time initial full load. For operational
systems, all you need is one-time extractions and data conversions.
• The complexity of data extraction for a data warehouse, therefore, warrant the
use of third-party data extraction tools(generally more expensive) in addition to
in-house programs or scripts.
• On the other hand, in-house programs increase the cost of maintenance and
are hard to maintain as source systems change.
• Effective data extraction is a key to the success of your data warehouse.
Therefore, you need to pay special attention to the issues and formulate a
data extraction strategy for your data warehouse.
Data Extraction issues:
1. Source Identification: Identify source applications and source structures.
2. Method of extraction: For each data source, define whether the extraction
process is manual or tool-based.
3. Extraction frequency: For each data source, establish how frequently the
data extraction must by done—daily, weekly, quarterly, and so on.
4. Time window: For each data source, denote the time window for the
extraction process.
5. Job sequencing: Determine whether the beginning of one job in an
extraction job stream has to wait until the previous job has finished
successfully.
6. Exception handling: Determine how to handle input records that cannot be
extracted.
Data Extraction Techniques
• Source data is in a state of constant flux because Business transactions keep
changing the data in the source systems.
• Data in the source systems are said to be time-dependent or temporal. This is
because source data changes with time.
• For example: When a customer moves to another state, the data about that
customer changes in the customer table in the source system.
• When two additional package types are added to the way a product may be sold,
the product data changes in the source system.
• When a correction is applied to the quantity ordered, the data about that order
gets changed in the source system.
Now consider the following scenario:
• A customer moves from New York state to California.
• In the operational system, current address of the customer that has CA as the
state code is important.
• The actual change transaction itself, stating that the previous state code was NY
and the revised state code is CA, need not be preserved.
• But think about how this change affects the information in the data warehouse?
• If the state code is used for analyzing some measurements such as sales, the
sales to the customer prior to the change must be counted in New York state
and those after the move must be counted in California.
• In other words, the history cannot be ignored in the data warehouse.
• So how do you capture the history from the source systems?
Data in Operational Systems
• These source systems generally store data in two ways:
• Current Value. Most of the attributes in the source systems fall into this
category. Here the stored value of an attribute represents the value of the
attribute at this moment of time.
• Periodic Status. This category is not as common as the previous category. In this
category, the value of the attribute is preserved as the status every time a
change occurs. At each of these points in time, the status value is stored with
reference to the time when the new value became effective
Data in Operational Systems
Extraction Techniques
• When you deploy your data warehouse, the initial data as of a certain time
must be moved to the data warehouse to get it started. This is the initial load.
• After the initial load, your data warehouse must be kept updated so the
history of the changes and statuses are reflected in the data warehouse.
• Broadly, there are two major types of data extractions from the source
operational systems:
• “as is” (static) data and data of revisions.
• “As is” or static data is the capture of data at a given point in time. It is like
taking a snapshot of the relevant source data at a certain point in time.
• You will use static data capture primarily for the initial load of the data
warehouse.
Extraction Techniques
• Data of revisions is also known as incremental data capture.
• For periodic status data or periodic event data, the incremental data
capture includes the values of attributes at specific times.
• Extract the statuses and events that have been recorded since the last
date of extract.
DATA TRANSFORMATION
• The extracted data is raw data and it cannot be applied to the data warehouse.
• All the extracted data must be made usable in the data warehouse.
• Having information that is usable for strategic decision making is the underlying
principle of the data warehouse.
• All the relevant data extracted from the dissimilar source systems have variety of
source data formats, data values, and the condition of the data quality(Data
quality is crucial – it assesses whether information can serve its purpose in a
particular context such as data analysis). Quality traits: accuracy, completeness,
reliability, relevance, and timeliness.
• You have to perform several types of transformations to make the source data
suitable for your data warehouse.
• Transformation of source data encompasses a wide variety of manipulations to
change all the extracted source data into usable information to be stored in the data
warehouse
Major Transformation Types
Format Revisions
• You will come across these quite often. These revisions include changes to the
data types and lengths of individual fields.
• In your source systems, product package types may be indicated by codes and
names in which the fields are numeric and text data types.
• Again, the lengths of the package types may vary among the different source
systems. It is wise to standardize and change the data type to text to provide
values meaningful to the users.
Decoding
• This is also a common type of data transformation. When you deal with
multiple source systems, you are bound to have the same data items
described by a plethora of field values.
• The classic example is the coding for gender, with one source system using 1
and 2 for male and female and another system using M and F.
• Also, many legacy systems use cryptic codes to represent business values.
What do the codes AC, IN, RE, and SU mean in a customer file? You need to
decode all such cryptic codes and change these into values that make sense to
the users.
Calculated and Derived Values
• What if you want to keep profit margin along with sales and cost amounts in
your data warehouse tables?
• The extracted data from the sales system contains sales amounts, sales units,
and operating cost estimates by product. You will have to calculate the total
cost and the profit margin before data can be stored in the data warehouse.
• When you calculate the total cost and the profit margin before data can be
stored in the data warehouse, This is an example of the calculated value. You
may also want to store customer's age separately—that would be an example
of the derived value.
Splitting of Single Fields
• In legacy systems, generally large text fields are used to store data such as
names (comprising of first name, middle initials, and last name) and
addresses (city, state, and Zip Code ) of customers and employees.
• You mostly need to store individual components in separate fields in your
data warehouse for two good reasons:
• First, you may improve the operating performance by indexing on individual
components.
• Second, your users may need to perform analysis by using individual
components such as city, state, and Zip Code.
Merging of Information
• This type of data transformation does not literally mean the merging of several
fields to create a single field of data.
• For example, information about a product may come from different data
sources. The product code and description may come from one data source.
The relevant package types may be found in another data source. The cost
data may be from yet another source.
• In this case, merging of information denotes the combination of the product
code, description, package types, and cost into a single entity.
Character Set Conversion
• This type of data transformation relates to the conversion of character sets to
an agreed standard character set for textual data in the data warehouse.
• If you have mainframe legacy systems as source systems, the source data from
these systems will be in EBCDIC characters. If PC-based architecture is the
choice for your data warehouse, then you must convert the mainframe EBCDIC
format to the ASCII format.
Conversion of Units of Measurements
• Many companies today have global branches.
• Measurements in many European countries are in metric units. If
your company has overseas operations, you may have to convert the
metrics so that the numbers may all be in one standard unit of
measurement.
Date/Time Conversion
• This type relates to representation of date and time in standard formats.
Summarization
• This type of transformation is the creating of summaries to be loaded in the
data warehouse instead of loading the most granular level of data.
• For example, for a credit card company to analyze sales patterns, it may not
be necessary to store in the data warehouse every single transaction on each
credit card. Instead, you may want to summarize the daily transactions for
each credit card and store the summary data.
Key Restructuring
• While extracting data from your input sources, look at the primary keys of
the extracted records.
• You will have to come up with keys for the fact and dimension tables based
on the keys in the extracted records.
• For example: the product code in this organization is structured to have
inherent meaning.
• So If you use this product code as the primary key, there would be problems.
• If the product is moved to another warehouse, the warehouse part of the
product key will have to be changed.
• When choosing keys for your data warehouse database tables, avoid such
keys with built-in meanings.
• Transform such keys into generic keys generated by the system itself. This is
called key restructuring.
Data Transformation: Key Restructuring
Deduplication
• In many companies, the customer files have several records for the same
customer. Mostly, the duplicates are the result of creating additional records
by mistake.
• In your data warehouse, you need to keep a single record for one customer
and link all the duplicates in the source systems to this single record.
• This process is called deduplication of the customer file.
DATA LOADING
• The next major set of functions consists of the ones that take the prepared
data, apply it to the data warehouse, and store it in the database there.
• Initial Load—populating all the data warehouse tables for the very first time
• Incremental Load—applying ongoing changes as necessary in a periodic
manner
• Full Refresh—completely erasing the contents of one or more tables and
reloading with fresh data (initial load is a refresh of all the tables).
• During the loads, the data warehouse has to be offline.
• You need to find a window of time when the loads may be scheduled without
affecting your data warehouse users.
• Therefore, consider dividing up the whole load process into smaller chunks
and populating a few files at a time. This will give you two benefits.
• You may be able to run the smaller loads in parallel. Also, you might be
able to keep some parts of the data warehouse up and running while
loading the other parts.
• It is hard to estimate the running times of the loads, especially the initial
load or a complete refresh. Do test loads to verify the correctness and to
estimate the running times.
Modes for applying data
1. Load
2. Append
3. Destructive merge and
4. Constructive merge
Load
• If the target table to be loaded already exists and data exists in the
table, the load process wipes out the existing data and applies the
data from the incoming file.
• If the table is already empty before loading, the load process simply
applies the data from the incoming file.
Append
• You may think of the append as an extension of the load.
• If data already exists in the table, the append process unconditionally
adds the incoming data, preserving the existing data in the target
table.
• When an incoming record is a duplicate of an already existing record,
you may define how to handle an incoming duplicate.
• The incoming record may be allowed to be added as a duplicate, or
the incoming duplicate record may be rejected during the append
process.
Destructive Merge
• In this mode, you apply the incoming data to the target data. If the
primary key of an incoming record matches with the key of an existing
record, update the matching target record.
• If the incoming record is a new record without a match with any
existing record, add the incoming record to the target table.
Constructive Merge
• This mode is slightly different from the destructive merge.
• If the primary key of an incoming record matches with the key of an
existing record, leave the existing record, add the incoming record, and
mark the added record as superseding the old record.
Modes of applying data: Summary
ETL versus ELT
Source:https://www.xplenty.com/blog/etl-vs-elt/
ETL versus ELT
• "Data Lakes" are special kinds of data stores that—unlike OLAP data
warehouses—accept any kind of structured or unstructured data.
• Data lakes don't require you to transform your data before loading it. You can
immediately load any kind of raw information into a data lake, no matter the
format or lack thereof.
• Data transformation is still necessary before analyzing the data with a
business intelligence platform. However, data cleansing, enrichment, and
transformation occur after loading the data into the data lake
• ELT is a relatively new technology, made possible because of modern, cloud-based
server technologies.
• Cloud-based data warehouses offer near-endless storage capabilities and scalable
processing power. For example, platforms like Amazon Redshift and Google
BigQuery.
• ELT paired with a data lake lets you ingest an ever-expanding pool of raw data
immediately, as it becomes available. There's no requirement to transform the
data into a special format before saving it in the data lake.
• Transforms only the data you need. ELT transforms only the data required for a
particular analysis.
• It can slow down the process of analyzing the data, it offers more flexibility—
because you can transform the data in different ways on the fly to produce
different types of metrics, forecasts and reports.
• ELT is less-reliable than ETL
• It’s important to note that the tools and systems of ELT are still evolving, so
they're not as reliable as ETL paired with an OLAP database.
• Although it takes more effort to setup, ETL provides more accurate insights
when dealing with massive pools of data.

More Related Content

Similar to Chapter 6.pptx

Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptRafiulHasan19
 
Data warehousing and data mart
Data warehousing and data martData warehousing and data mart
Data warehousing and data martAmit Sarkar
 
Data Mining & Data Warehousing
Data Mining & Data WarehousingData Mining & Data Warehousing
Data Mining & Data WarehousingAAKANKSHA JAIN
 
data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...aasifkuchey85
 
Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Vibrant Technologies & Computers
 
Advance database system (part 2)
Advance database system (part 2)Advance database system (part 2)
Advance database system (part 2)Abdullah Khosa
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxnikshaikh786
 
Data Mart Lake Ware.pptx
Data Mart Lake Ware.pptxData Mart Lake Ware.pptx
Data Mart Lake Ware.pptxBalasundaramSr
 
dbms Unit 1.pdf arey bhai teri maa chodunga
dbms Unit 1.pdf arey bhai teri maa chodungadbms Unit 1.pdf arey bhai teri maa chodunga
dbms Unit 1.pdf arey bhai teri maa chodungaVaradKadtan1
 
data warehousing
data warehousingdata warehousing
data warehousing143sohil
 
Data warehouse - Nivetha Durganathan
Data warehouse - Nivetha DurganathanData warehouse - Nivetha Durganathan
Data warehouse - Nivetha DurganathanNivetha Durganathan
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 

Similar to Chapter 6.pptx (20)

DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
 
Data warehousing and data mart
Data warehousing and data martData warehousing and data mart
Data warehousing and data mart
 
Intro.pptx
Intro.pptxIntro.pptx
Intro.pptx
 
Data warehouseold
Data warehouseoldData warehouseold
Data warehouseold
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
 
Data Mining & Data Warehousing
Data Mining & Data WarehousingData Mining & Data Warehousing
Data Mining & Data Warehousing
 
data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...
 
Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.
 
Advance database system (part 2)
Advance database system (part 2)Advance database system (part 2)
Advance database system (part 2)
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
 
Data Mart Lake Ware.pptx
Data Mart Lake Ware.pptxData Mart Lake Ware.pptx
Data Mart Lake Ware.pptx
 
dbms Unit 1.pdf arey bhai teri maa chodunga
dbms Unit 1.pdf arey bhai teri maa chodungadbms Unit 1.pdf arey bhai teri maa chodunga
dbms Unit 1.pdf arey bhai teri maa chodunga
 
data warehousing
data warehousingdata warehousing
data warehousing
 
Data warehouse - Nivetha Durganathan
Data warehouse - Nivetha DurganathanData warehouse - Nivetha Durganathan
Data warehouse - Nivetha Durganathan
 
Datawarehouse org
Datawarehouse orgDatawarehouse org
Datawarehouse org
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
Information Management
Information ManagementInformation Management
Information Management
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Chapter 6.pptx

  • 1. The material used in this presentation i.e., pictures/graphs/text, etc. is solely intended for educational/teaching purpose, offered free of cost to the students for use under special circumstances of Online Education due to COVID-19 Lockdown situation and may include copyrighted material - the use of which may not have been specifically authorised by Copyright Owners. It’s application constitutes Fair Use of any such copyrighted material as provided in globally accepted law of many countries. The contents of presentations are intended only for the attendees of the class being conducted by the presenter. Fair Use Notice
  • 2. ETL :Data Extraction, Transformation and Loading Disclaimer: The contents in this presentation have been taken from multiple resources available at the internet including books, notes, reports, websites and presentations.
  • 3. ETL Overview • Technical architecture of the data warehouse comprises of three functional areas. These areas are data acquisition, data storage, and information delivery. • Data extraction, transformation, and loading (ETL process) encompass the areas of data acquisition and data storage. • These are back-end processes that cover the extraction of data from the source systems. Next, they include all the functions and procedures for changing the source data into the exact formats and structures appropriate for storage in the data warehouse database. • After the transformation of the data, these processes consist of all the functions for physically moving the data into the data warehouse repository
  • 4. • You will not extract all of operational data and dump it into the data warehouse. • This approach is something driven by the user requirements. Your requirements definition should guide you as to what data you need to extract and from which source systems.
  • 5. Challenges in ETL • Source systems are very diverse and disparate. • There is usually a need to deal with source systems on multiple platforms and different operating systems. • Many source systems are older legacy applications running on obsolete database technologies. • Generally, historical data on changes in values are not preserved in source operational systems. Historical information is critical in a data warehouse. • Quality of data is dubious in many old source systems that have evolved over time. • Source system structures keep changing over time because of new business conditions. ETL functions must also be modified accordingly.
  • 6. • Gross lack of consistency among source systems is commonly prevalent. Same data is likely to be represented differently in the various source systems. • For example, data on salary may be represented as monthly salary, weekly salary, and bimonthly salary in different source payroll systems. • Even when inconsistent data is detected among disparate source systems, lack of a means for resolving mismatches escalates the problem of inconsistency. • Most source systems do not represent data in types or formats that are meaningful to the users. Many representations are cryptic and ambiguous. (This is a prevalent problem in legacy systems, many of which were not designed with end-users in mind. For example, the customer status codes could have been started with R = Regular and N = New. Then at one time, another code D = Deceased could have been added. Down the road, a further code A = Archive could have been included.) • Project team spend as much as 50–70% of the project effort on ETL functions.
  • 7. DATA EXTRACTION • Data extraction for a new operational system is quite different from the data extraction for a data warehouse. • First, for a data warehouse, you have to extract data from many disparate sources. • Next, for a data warehouse, you have to extract data on the changes for ongoing incremental loads as well as for a one-time initial full load. For operational systems, all you need is one-time extractions and data conversions. • The complexity of data extraction for a data warehouse, therefore, warrant the use of third-party data extraction tools(generally more expensive) in addition to in-house programs or scripts.
  • 8. • On the other hand, in-house programs increase the cost of maintenance and are hard to maintain as source systems change. • Effective data extraction is a key to the success of your data warehouse. Therefore, you need to pay special attention to the issues and formulate a data extraction strategy for your data warehouse.
  • 9. Data Extraction issues: 1. Source Identification: Identify source applications and source structures. 2. Method of extraction: For each data source, define whether the extraction process is manual or tool-based. 3. Extraction frequency: For each data source, establish how frequently the data extraction must by done—daily, weekly, quarterly, and so on. 4. Time window: For each data source, denote the time window for the extraction process. 5. Job sequencing: Determine whether the beginning of one job in an extraction job stream has to wait until the previous job has finished successfully. 6. Exception handling: Determine how to handle input records that cannot be extracted.
  • 10. Data Extraction Techniques • Source data is in a state of constant flux because Business transactions keep changing the data in the source systems. • Data in the source systems are said to be time-dependent or temporal. This is because source data changes with time. • For example: When a customer moves to another state, the data about that customer changes in the customer table in the source system. • When two additional package types are added to the way a product may be sold, the product data changes in the source system. • When a correction is applied to the quantity ordered, the data about that order gets changed in the source system.
  • 11. Now consider the following scenario: • A customer moves from New York state to California. • In the operational system, current address of the customer that has CA as the state code is important. • The actual change transaction itself, stating that the previous state code was NY and the revised state code is CA, need not be preserved. • But think about how this change affects the information in the data warehouse? • If the state code is used for analyzing some measurements such as sales, the sales to the customer prior to the change must be counted in New York state and those after the move must be counted in California. • In other words, the history cannot be ignored in the data warehouse. • So how do you capture the history from the source systems?
  • 12. Data in Operational Systems • These source systems generally store data in two ways: • Current Value. Most of the attributes in the source systems fall into this category. Here the stored value of an attribute represents the value of the attribute at this moment of time. • Periodic Status. This category is not as common as the previous category. In this category, the value of the attribute is preserved as the status every time a change occurs. At each of these points in time, the status value is stored with reference to the time when the new value became effective
  • 14. Extraction Techniques • When you deploy your data warehouse, the initial data as of a certain time must be moved to the data warehouse to get it started. This is the initial load. • After the initial load, your data warehouse must be kept updated so the history of the changes and statuses are reflected in the data warehouse. • Broadly, there are two major types of data extractions from the source operational systems: • “as is” (static) data and data of revisions. • “As is” or static data is the capture of data at a given point in time. It is like taking a snapshot of the relevant source data at a certain point in time. • You will use static data capture primarily for the initial load of the data warehouse.
  • 15. Extraction Techniques • Data of revisions is also known as incremental data capture. • For periodic status data or periodic event data, the incremental data capture includes the values of attributes at specific times. • Extract the statuses and events that have been recorded since the last date of extract.
  • 16. DATA TRANSFORMATION • The extracted data is raw data and it cannot be applied to the data warehouse. • All the extracted data must be made usable in the data warehouse. • Having information that is usable for strategic decision making is the underlying principle of the data warehouse. • All the relevant data extracted from the dissimilar source systems have variety of source data formats, data values, and the condition of the data quality(Data quality is crucial – it assesses whether information can serve its purpose in a particular context such as data analysis). Quality traits: accuracy, completeness, reliability, relevance, and timeliness. • You have to perform several types of transformations to make the source data suitable for your data warehouse. • Transformation of source data encompasses a wide variety of manipulations to change all the extracted source data into usable information to be stored in the data warehouse
  • 17. Major Transformation Types Format Revisions • You will come across these quite often. These revisions include changes to the data types and lengths of individual fields. • In your source systems, product package types may be indicated by codes and names in which the fields are numeric and text data types. • Again, the lengths of the package types may vary among the different source systems. It is wise to standardize and change the data type to text to provide values meaningful to the users.
  • 18. Decoding • This is also a common type of data transformation. When you deal with multiple source systems, you are bound to have the same data items described by a plethora of field values. • The classic example is the coding for gender, with one source system using 1 and 2 for male and female and another system using M and F. • Also, many legacy systems use cryptic codes to represent business values. What do the codes AC, IN, RE, and SU mean in a customer file? You need to decode all such cryptic codes and change these into values that make sense to the users.
  • 19. Calculated and Derived Values • What if you want to keep profit margin along with sales and cost amounts in your data warehouse tables? • The extracted data from the sales system contains sales amounts, sales units, and operating cost estimates by product. You will have to calculate the total cost and the profit margin before data can be stored in the data warehouse. • When you calculate the total cost and the profit margin before data can be stored in the data warehouse, This is an example of the calculated value. You may also want to store customer's age separately—that would be an example of the derived value.
  • 20. Splitting of Single Fields • In legacy systems, generally large text fields are used to store data such as names (comprising of first name, middle initials, and last name) and addresses (city, state, and Zip Code ) of customers and employees. • You mostly need to store individual components in separate fields in your data warehouse for two good reasons: • First, you may improve the operating performance by indexing on individual components. • Second, your users may need to perform analysis by using individual components such as city, state, and Zip Code.
  • 21. Merging of Information • This type of data transformation does not literally mean the merging of several fields to create a single field of data. • For example, information about a product may come from different data sources. The product code and description may come from one data source. The relevant package types may be found in another data source. The cost data may be from yet another source. • In this case, merging of information denotes the combination of the product code, description, package types, and cost into a single entity.
  • 22. Character Set Conversion • This type of data transformation relates to the conversion of character sets to an agreed standard character set for textual data in the data warehouse. • If you have mainframe legacy systems as source systems, the source data from these systems will be in EBCDIC characters. If PC-based architecture is the choice for your data warehouse, then you must convert the mainframe EBCDIC format to the ASCII format.
  • 23. Conversion of Units of Measurements • Many companies today have global branches. • Measurements in many European countries are in metric units. If your company has overseas operations, you may have to convert the metrics so that the numbers may all be in one standard unit of measurement.
  • 24. Date/Time Conversion • This type relates to representation of date and time in standard formats. Summarization • This type of transformation is the creating of summaries to be loaded in the data warehouse instead of loading the most granular level of data. • For example, for a credit card company to analyze sales patterns, it may not be necessary to store in the data warehouse every single transaction on each credit card. Instead, you may want to summarize the daily transactions for each credit card and store the summary data.
  • 25. Key Restructuring • While extracting data from your input sources, look at the primary keys of the extracted records. • You will have to come up with keys for the fact and dimension tables based on the keys in the extracted records. • For example: the product code in this organization is structured to have inherent meaning. • So If you use this product code as the primary key, there would be problems. • If the product is moved to another warehouse, the warehouse part of the product key will have to be changed. • When choosing keys for your data warehouse database tables, avoid such keys with built-in meanings. • Transform such keys into generic keys generated by the system itself. This is called key restructuring.
  • 26. Data Transformation: Key Restructuring
  • 27. Deduplication • In many companies, the customer files have several records for the same customer. Mostly, the duplicates are the result of creating additional records by mistake. • In your data warehouse, you need to keep a single record for one customer and link all the duplicates in the source systems to this single record. • This process is called deduplication of the customer file.
  • 28. DATA LOADING • The next major set of functions consists of the ones that take the prepared data, apply it to the data warehouse, and store it in the database there. • Initial Load—populating all the data warehouse tables for the very first time • Incremental Load—applying ongoing changes as necessary in a periodic manner • Full Refresh—completely erasing the contents of one or more tables and reloading with fresh data (initial load is a refresh of all the tables). • During the loads, the data warehouse has to be offline. • You need to find a window of time when the loads may be scheduled without affecting your data warehouse users.
  • 29. • Therefore, consider dividing up the whole load process into smaller chunks and populating a few files at a time. This will give you two benefits. • You may be able to run the smaller loads in parallel. Also, you might be able to keep some parts of the data warehouse up and running while loading the other parts. • It is hard to estimate the running times of the loads, especially the initial load or a complete refresh. Do test loads to verify the correctness and to estimate the running times.
  • 30. Modes for applying data 1. Load 2. Append 3. Destructive merge and 4. Constructive merge
  • 31. Load • If the target table to be loaded already exists and data exists in the table, the load process wipes out the existing data and applies the data from the incoming file. • If the table is already empty before loading, the load process simply applies the data from the incoming file.
  • 32. Append • You may think of the append as an extension of the load. • If data already exists in the table, the append process unconditionally adds the incoming data, preserving the existing data in the target table. • When an incoming record is a duplicate of an already existing record, you may define how to handle an incoming duplicate. • The incoming record may be allowed to be added as a duplicate, or the incoming duplicate record may be rejected during the append process.
  • 33. Destructive Merge • In this mode, you apply the incoming data to the target data. If the primary key of an incoming record matches with the key of an existing record, update the matching target record. • If the incoming record is a new record without a match with any existing record, add the incoming record to the target table.
  • 34. Constructive Merge • This mode is slightly different from the destructive merge. • If the primary key of an incoming record matches with the key of an existing record, leave the existing record, add the incoming record, and mark the added record as superseding the old record.
  • 35. Modes of applying data: Summary
  • 37. ETL versus ELT • "Data Lakes" are special kinds of data stores that—unlike OLAP data warehouses—accept any kind of structured or unstructured data. • Data lakes don't require you to transform your data before loading it. You can immediately load any kind of raw information into a data lake, no matter the format or lack thereof. • Data transformation is still necessary before analyzing the data with a business intelligence platform. However, data cleansing, enrichment, and transformation occur after loading the data into the data lake
  • 38. • ELT is a relatively new technology, made possible because of modern, cloud-based server technologies. • Cloud-based data warehouses offer near-endless storage capabilities and scalable processing power. For example, platforms like Amazon Redshift and Google BigQuery. • ELT paired with a data lake lets you ingest an ever-expanding pool of raw data immediately, as it becomes available. There's no requirement to transform the data into a special format before saving it in the data lake. • Transforms only the data you need. ELT transforms only the data required for a particular analysis. • It can slow down the process of analyzing the data, it offers more flexibility— because you can transform the data in different ways on the fly to produce different types of metrics, forecasts and reports.
  • 39. • ELT is less-reliable than ETL • It’s important to note that the tools and systems of ELT are still evolving, so they're not as reliable as ETL paired with an OLAP database. • Although it takes more effort to setup, ETL provides more accurate insights when dealing with massive pools of data.