ETL_Methodology.pptx

PRESENTED BY,
YOGESH SURYAWANSHI.
ETL METHODOLOGIES

1
AGENDA
1. Overview
2. Staging Database
3. Extract Methodology
4. Transform Methodology
5. Load Methodology
6. Best Practices ELT Process
7. Summary

2
OVERVIEW
 Extraction Transformation Loading – ETL
 To get data out of the source and load it into the data warehouse – simply
a process of copying data from one database to other
 Data is extracted from OLTP database, transformed to match the data
warehouse schema and loaded into the data warehouse database
 Many data warehouses also incorporate data from non-OLTP systems
such as text files, legacy systems, and spreadsheets; such data also
requires extraction, transformation and loading
 When defining ETL for a data warehouse, it is important to think of ETL as
a process, not a physical implementation

3
OVERVIEW
 ETL is often a complex combination of process and technology that
consumes a significant portion of data warehouse development efforts
and requires the skills of business analysts, database designers and
application developers
 It is not a one-time event as new data is added to the Data Warehouse
periodically – Monthly, daily, hourly
 ETL is an integral, ongoing and recurring part of data warehouse
 Automated
 Well documented
 Easily changeable

4
STAGING AKA OPERATIONAL DATASTORE (ODS)
 ETL operations should be performed on a relational database server
separate from the source databases and the data warehouse database
 Creates a logical and physical separation between the source systems
and the data warehouse
 Minimizes the impact of the intense periodic ETL activity on source and
data warehouse databases

6
EXTRACT
 During data extraction, raw data is
copied or exported from source
locations to a staging area.
 Data management teams can extract
data from a variety of data sources,
which can be structured, semi-
structured, unstructured & streaming.
 Those sources include :
 SQL or NoSQL servers
 CRM and ERP systems
 Flat files
 Email
 Web pages

7
EXTRACTION
 ETL process needs to effectively integrate systems that have DBMS,
Hardware, Operating Systems and communication Protocols. Sources
include legacy applications like mainframes, customized applications,
Point of contact devices like ATM, call switches, text files, spreadsheet,
ERP, data from vendors, Partners amongst others.
 Need to have a logical data map before the physical data can be
transformed
 The logical data map describes the relationship between the extreme
starting points and the extreme ending points of your ETL system
usually presented in a table or spreadsheet

8
EXTRACTION
 The content of the logical data mapping document has been proven to
be the critical element required to efficiently plan ETL processes
 The table type gives us our queue for the ordinal position of our data
load processes—first dimensions, then facts.
 The primary purpose of this document is to provide the ETL developer
with a clear-cut blueprint (Transformation rules and logic) of exactly
what is expected from the ETL process. This table must depict, without
question, the course of action involved in the transformation process
 The transformation can contain anything from the absolute solution to
nothing at all. Most often, the transformation can be expressed in SQL.
The SQL may or may not be the complete statement

9
EXTRACTION
 Three Data Extraction methods:
 Full Extraction
 Partial Extraction – without update notification
 Partial Extraction – with update notification
 Irrespective of the method used, extraction should not affect performance
and response time of the source systems. These source systems could
be test or development or production databases. Any slow or locking
could affect company’s bottom line.

10
EXTRACTION
 Some validations are done during Extraction:
 Reconcile records with the source data
 Make sure that no spam/unwanted data loaded
 Data type check
 Remove all types of duplicate/fragmented data
 Check whether all keys are in place or not

12
TRANSFORM
 In the staging area, the raw data undergoes
data processing.Here, the data is transformed
and consolidated for its intended analytical
use case.
 Filtering, cleansing, de-duplicating, validating,
and authenticating the data.Performing
calculations, translations, or summarizations
based on the raw data.

13
TRANSFORM
 This can include changing row and column headers for consistency,
converting currencies or other units of measurement, editing text strings,
and more.Conducting audits to ensure data quality and compliance
 Removing, encrypting, or protecting data governed by industry or
governmental regulators Formatting the data into tables or joined tables to
match the schema of the target data warehouse.

14
TRANSFORM AKA CLEANSE DATA
 Anomaly Detection
 Data sampling – count(*) of the rows for a department column
 Column Property Enforcement
 Null Values in required columns
 Numeric values that fall outside of expected high and lows
 Columns whose lengths are exceptionally short/long
 Columns with certain values outside of discrete valid value sets
 Adherence to a required pattern/ member of a set of pattern

16
TRANSFORM - CONFIRMING
 Structure Enforcement
 Tables have proper primary and foreign keys
 Obey referential integrity
 Data and Rule value enforcement
 Simple business rules
 Logical data checks

18
 Data Integrity Problems
 Different spelling of the same person as Jon, John, etc.
 There are multiple ways to denote company name like Google, Google
Inc.
 Use of different names like Cleveland, cleveland .
 There may be a case that different account numbers are generated by
various applications for the same customer.
 In some data required files remains blank
 Invalid product collected at POS as manual entry can lead to mistakes.

20
 Validation During the Stage
 Filtering — Select only certain columns to load
 Using rules and lookup tables for Data standardization
 Character Set Conversion and encoding handling
 Conversion of Units of Measurements like Date Time Conversion,
currency conversions, numerical conversions, etc.
 Data threshold validation check. For example, age cannot be more
than two digits for an employee.
 Data flow validation from the staging area to the intermediate tables.

21
 Validation During the Stage
 Required fields should not be left blank.
 Cleaning ( for example, mapping NULL to 0 or Gender Male to "M"
and Female to "F" etc.)
 Split a column into multiples and merging multiple columns into a
single column.
 Transposing rows and columns.
 Use lookups to merge data
 Using any complex data validation (e.g., if the first two columns in a
row are empty then it automatically reject the row from processing)

23
LOAD
 In this last step, the transformed data is
moved from the staging area into a target
data warehouse.
 Typically, this involves an initial loading of
all data, followed by periodic loading of
incremental data changes and, less often,
full refreshes to erase and replace data in
the warehouse.

25
LOAD
 Loading data into the target Datawarehouse database is the last step of
the ETL process. In a typical Data warehouse, huge volume of data needs
to be loaded in a relatively short period (nights). Hence, load process
should be optimized for performance
 In case of load failure, recover mechanisms should be configured to
restart from the point of failure without data integrity loss. Data Warehouse
admins need to monitor, resume, cancel loads as per prevailing server
performance

26
LOAD
 Types of Loading :
 Initial Load – Populating all the Data Warehouse tables
 Incremental Load – apply ongoing changes as when needed
periodically
 Full refresh – erasing the contents of one or more tables and reloading
with fresh Data

27
LOAD
 Load verification
 Ensure that the key field data is neither missing nor null. Test modeling
views based on the target tables.
 Check that combined values and calculated measures.
 Data checks in dimension table as well as history table.
 Check the BI reports on the loaded fact and dimension table.

28
BEST PRACTICES ELT PROCESS
 Never try to cleanse all the data
 Every organization would like to have all the data clean, but most of
them are not ready to pay to wait or not ready to wait. To clean it all
would simply take too long, So it is better not to try to cleanse all the
data. Cleanse only relevant data.
 Never cleanse Anything
 Always plan to clean something because the biggest reason for
building the Data Warehouse is to offer cleaner and more reliable data.

29
BEST PRACTICES ELT PROCESS
 Determine the cost of cleansing the data
 Before cleansing all the dirty data, it is important for you to
determine the cleansing cost for every dirty data element.
 Determine the cost per data element.

30
SUMMARY
 ETL is an abbreviation of Extract, Transform and Load.
 ETL provides a method of moving the data from various sources into a
data warehouse.
 In the first step extraction, data is extracted from the source system
into the staging area.
 In the transformation step, the data extracted from source is cleansed
and transformed.
 Loading data into the target Datawarehouse is the last step of the ETL
process.

32
REFERENCE
 ETL Process in Data Warehouse by Chirayu Poundarik.
 ETL Methodology - https://www.ibm.com/in-en/cloud/learn/etl.

ETL_Methodology.pptx

Recommended

Recommended

More Related Content

Similar to ETL_Methodology.pptx

Similar to ETL_Methodology.pptx (20)

Recently uploaded

Recently uploaded (20)

ETL_Methodology.pptx

Editor's Notes