ETL stands for extract, transform, and load and is a traditionally accepted way for organizations to combine data from multiple systems into a single database, data store, data warehouse, or data lake.
3. 2
OVERVIEW
Extraction Transformation Loading – ETL
To get data out of the source and load it into the data warehouse – simply
a process of copying data from one database to other
Data is extracted from OLTP database, transformed to match the data
warehouse schema and loaded into the data warehouse database
Many data warehouses also incorporate data from non-OLTP systems
such as text files, legacy systems, and spreadsheets; such data also
requires extraction, transformation and loading
When defining ETL for a data warehouse, it is important to think of ETL as
a process, not a physical implementation
4. 3
OVERVIEW
ETL is often a complex combination of process and technology that
consumes a significant portion of data warehouse development efforts
and requires the skills of business analysts, database designers and
application developers
It is not a one-time event as new data is added to the Data Warehouse
periodically – Monthly, daily, hourly
ETL is an integral, ongoing and recurring part of data warehouse
Automated
Well documented
Easily changeable
5. 4
STAGING AKA OPERATIONAL DATASTORE (ODS)
ETL operations should be performed on a relational database server
separate from the source databases and the data warehouse database
Creates a logical and physical separation between the source systems
and the data warehouse
Minimizes the impact of the intense periodic ETL activity on source and
data warehouse databases
7. 6
EXTRACT
During data extraction, raw data is
copied or exported from source
locations to a staging area.
Data management teams can extract
data from a variety of data sources,
which can be structured, semi-
structured, unstructured & streaming.
Those sources include :
SQL or NoSQL servers
CRM and ERP systems
Flat files
Email
Web pages
8. 7
EXTRACTION
ETL process needs to effectively integrate systems that have DBMS,
Hardware, Operating Systems and communication Protocols. Sources
include legacy applications like mainframes, customized applications,
Point of contact devices like ATM, call switches, text files, spreadsheet,
ERP, data from vendors, Partners amongst others.
Need to have a logical data map before the physical data can be
transformed
The logical data map describes the relationship between the extreme
starting points and the extreme ending points of your ETL system
usually presented in a table or spreadsheet
9. 8
EXTRACTION
The content of the logical data mapping document has been proven to
be the critical element required to efficiently plan ETL processes
The table type gives us our queue for the ordinal position of our data
load processes—first dimensions, then facts.
The primary purpose of this document is to provide the ETL developer
with a clear-cut blueprint (Transformation rules and logic) of exactly
what is expected from the ETL process. This table must depict, without
question, the course of action involved in the transformation process
The transformation can contain anything from the absolute solution to
nothing at all. Most often, the transformation can be expressed in SQL.
The SQL may or may not be the complete statement
10. 9
EXTRACTION
Three Data Extraction methods:
Full Extraction
Partial Extraction – without update notification
Partial Extraction – with update notification
Irrespective of the method used, extraction should not affect performance
and response time of the source systems. These source systems could
be test or development or production databases. Any slow or locking
could affect company’s bottom line.
11. 10
EXTRACTION
Some validations are done during Extraction:
Reconcile records with the source data
Make sure that no spam/unwanted data loaded
Data type check
Remove all types of duplicate/fragmented data
Check whether all keys are in place or not
13. 12
TRANSFORM
In the staging area, the raw data undergoes
data processing.Here, the data is transformed
and consolidated for its intended analytical
use case.
Filtering, cleansing, de-duplicating, validating,
and authenticating the data.Performing
calculations, translations, or summarizations
based on the raw data.
14. 13
TRANSFORM
This can include changing row and column headers for consistency,
converting currencies or other units of measurement, editing text strings,
and more.Conducting audits to ensure data quality and compliance
Removing, encrypting, or protecting data governed by industry or
governmental regulators Formatting the data into tables or joined tables to
match the schema of the target data warehouse.
15. 14
TRANSFORM AKA CLEANSE DATA
Anomaly Detection
Data sampling – count(*) of the rows for a department column
Column Property Enforcement
Null Values in required columns
Numeric values that fall outside of expected high and lows
Columns whose lengths are exceptionally short/long
Columns with certain values outside of discrete valid value sets
Adherence to a required pattern/ member of a set of pattern
17. 16
TRANSFORM - CONFIRMING
Structure Enforcement
Tables have proper primary and foreign keys
Obey referential integrity
Data and Rule value enforcement
Simple business rules
Logical data checks
19. 18
TRANSFORM - CONFIRMING
Data Integrity Problems
Different spelling of the same person as Jon, John, etc.
There are multiple ways to denote company name like Google, Google
Inc.
Use of different names like Cleveland, cleveland .
There may be a case that different account numbers are generated by
various applications for the same customer.
In some data required files remains blank
Invalid product collected at POS as manual entry can lead to mistakes.
21. 20
TRANSFORM - CONFIRMING
Validation During the Stage
Filtering — Select only certain columns to load
Using rules and lookup tables for Data standardization
Character Set Conversion and encoding handling
Conversion of Units of Measurements like Date Time Conversion,
currency conversions, numerical conversions, etc.
Data threshold validation check. For example, age cannot be more
than two digits for an employee.
Data flow validation from the staging area to the intermediate tables.
22. 21
TRANSFORM - CONFIRMING
Validation During the Stage
Required fields should not be left blank.
Cleaning ( for example, mapping NULL to 0 or Gender Male to "M"
and Female to "F" etc.)
Split a column into multiples and merging multiple columns into a
single column.
Transposing rows and columns.
Use lookups to merge data
Using any complex data validation (e.g., if the first two columns in a
row are empty then it automatically reject the row from processing)
24. 23
LOAD
In this last step, the transformed data is
moved from the staging area into a target
data warehouse.
Typically, this involves an initial loading of
all data, followed by periodic loading of
incremental data changes and, less often,
full refreshes to erase and replace data in
the warehouse.
26. 25
LOAD
Loading data into the target Datawarehouse database is the last step of
the ETL process. In a typical Data warehouse, huge volume of data needs
to be loaded in a relatively short period (nights). Hence, load process
should be optimized for performance
In case of load failure, recover mechanisms should be configured to
restart from the point of failure without data integrity loss. Data Warehouse
admins need to monitor, resume, cancel loads as per prevailing server
performance
27. 26
LOAD
Types of Loading :
Initial Load – Populating all the Data Warehouse tables
Incremental Load – apply ongoing changes as when needed
periodically
Full refresh – erasing the contents of one or more tables and reloading
with fresh Data
28. 27
LOAD
Load verification
Ensure that the key field data is neither missing nor null. Test modeling
views based on the target tables.
Check that combined values and calculated measures.
Data checks in dimension table as well as history table.
Check the BI reports on the loaded fact and dimension table.
29. 28
BEST PRACTICES ELT PROCESS
Never try to cleanse all the data
Every organization would like to have all the data clean, but most of
them are not ready to pay to wait or not ready to wait. To clean it all
would simply take too long, So it is better not to try to cleanse all the
data. Cleanse only relevant data.
Never cleanse Anything
Always plan to clean something because the biggest reason for
building the Data Warehouse is to offer cleaner and more reliable data.
30. 29
BEST PRACTICES ELT PROCESS
Determine the cost of cleansing the data
Before cleansing all the dirty data, it is important for you to
determine the cleansing cost for every dirty data element.
Determine the cost per data element.
31. 30
SUMMARY
ETL is an abbreviation of Extract, Transform and Load.
ETL provides a method of moving the data from various sources into a
data warehouse.
In the first step extraction, data is extracted from the source system
into the staging area.
In the transformation step, the data extracted from source is cleansed
and transformed.
Loading data into the target Datawarehouse is the last step of the ETL
process.
32. 32
REFERENCE
ETL Process in Data Warehouse by Chirayu Poundarik.
ETL Methodology - https://www.ibm.com/in-en/cloud/learn/etl.
To Change the Background Picture Follow the steps below
Mouse Right button click>Format Background>Select Picture or Texture File>Click “File” button>Browse and select the image from your computer>Click Insert
That’s it. You are Done !!!