Data Warehouse:
A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format.
Reconciled data: detailed, current data intended to be the single, authoritative source for all decision support.
Extraction:
The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible.
Data Transformation:
Data transformation is the component of data reconcilation that converts data from the format of the source operational systems to the format of enterprise data warehouse.
Data Loading:
During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database. In order to make the load process efficient, it is helpful to disable any constraints and indexes before the load and enable them back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency.
2. Need To Know:
• What is data Ware house?
• Why They Are Necessary?
• How They Are Constructed?
3. Data Warehouse:
• A physical repository where relational data are specially organized to
provide enterprise-wide, cleansed data in a standardized format
– Subject-oriented: e.g. customers, patients, students, products
– Integrated: Consistent naming conventions, formats, encoding
structures; from multiple data sources
– Time-variant: Can study trends and changes
– Non-updatable: Read-only, periodically refreshed
What is data Ware house?
4. – A data warehouse centralizes data that are scattered
throughout disparate operational systems and makes them
available for Decision Support.
• A subject-oriented, integrated, time-variant, non-updatable collection of
data used in support of management decision-making processes
Why They Are Necessary?
Operational System(OLTP) Informational System(OLAP)
Needs The reconciliation of data!!
8. 8
• Extract
• Transform
• Load
The ETL Process
Data is:
extracted from an OLTP database(Relational)
transformed to match the data warehouse schema
loaded into the data warehouse
9. Typical operational data is:
– Transient – not historical
– Not normalized (perhaps due
to de-normalization for
performance)
– Restricted in scope – not
comprehensive
– Sometimes poor quality –
inconsistencies and errors
Purpose Of ETL Process:
After ETL, data should be:
Detailed – not summarized yet
Historical – periodic
Normalized – 3rd normal form or
higher
Comprehensive – enterprise-wide
perspective
Timely – data should be current
enough to assist decision-making
Quality controlled – accurate with
full integrity
12. Extraction
The Extract step covers the data extraction from the source system and
makes it accessible for further processing. The main objective of the
extract step is to retrieve all the required data from the source system
with as little resources as possible.
Logically data can be extracted in to
ways before physical data extraction
13. Full Extraction: Full extraction is used when the data needs to be extracted and
loaded for the first time. In full extraction, the data from the source is extracted
completely. This extraction reflects the current data available in the source system.
Incremental Extraction: In incremental extraction, the changes in source data
need to be tracked since the last successful extraction. Only these changes in data
will be extracted and then loaded. These changes can be detected from the source
data which have the last changed timestamp. Also a change table can be created
in the source system, which keeps track of the changes in the source data.
One more method to get the incremental changes is to extract the complete source
data and then do a difference (minus operation) between the current extraction and
last extraction. This approach causes a performance issue.
Logical Extraction:
14. The data can be extracted physically by two methods:
Online Extraction: In online extraction the data is extracted directly from the
source system. The extraction process connects to the source system and extracts
the source data. Here the data is extracted directly from the Source for processing
in the staging area, that’s why it’s called online extraction. During Extraction we
connect directly to the source system and then access the source tables. There is
no need of any external staging area
Offline Extraction: The data from the source system is dumped outside of the
source system into a flat file. This flat file is used to extract the data. The flat file
can be created by a routine process daily. Here the data is not extracted directly
from the source, but instead it’s taken from another external area which keeps the
copy of source. The external area can be Flat files, or some dump files in a
specific format. So when we need to process the data we can fetch the records
from the external source instead of the actual source.
Physical extraction:
15. What data Should be extracted?
• The selection and analysis of the source system is usually broken into
two major phases:
– The data discovery phase
– The anomaly detection phase
16. Extraction - Data Discovery Phase
• Data Discovery Phase
key criterion for the success of the data warehouse is the cleanliness
and cohesiveness of the data within it
• Once you understand what the target needs to look like, you need to
identify and examine the data sources
17. Data Discovery Phase
• It is up to the ETL team to drill down further into the data requirements to determine each
and every source system, table, and attribute required to load the data warehouse
– Collecting and Documenting Source Systems
– Keeping track of source systems
– Determining the System of Record - Point of originating of data
– Definition of the system-of-record is important because in most enterprises data is stored
redundantly across many different systems.
– Enterprises do this to make nonintegrated systems share data. It is very common that the same
piece of data is copied, moved, manipulated, transformed, altered, cleansed, or made corrupt
throughout the enterprise, resulting in varying versions of the same data
18. Data Content Analysis - Extraction
• Understanding the content of the data is crucial for determining the best approach for retrieval
- NULL values. An unhandled NULL value can destroy any ETL process. NULL values pose
the biggest risk when they are in foreign key columns. Joining two or more tables based on a
column that contains NULL values will cause data loss! Remember, in a relational database
NULL is not equal to NULL. That is why those joins fail. Check for NULL values in every foreign
key in the source database. When NULL values are present, you must outer join the tables
- Dates in non date fields. Dates are very peculiar elements because they are the only logical
elements that can come in various formats, literally containing different values and having the
exact same meaning. Fortunately, most database systems support most of the various formats
for display purposes but store them in a single standard format
20. 20
Data Transformation
• Data transformation is the component of data reconcilation that converts
data from the format of the source operational systems to the format of
enterprise data warehouse.
• Data transformation consists of a variety of different functions:
– record-level functions,
– field-level functions and
– more complex transformation.
21. 21
Record-level functions & Field-level functions
• Record-level functions
– Selection: data partitioning
– Joining: data combining
– Normalization
– Aggregation: data summarization, Aggregates
• Field-level functions
– Single-field transformation: from one field to one field
– Multi-field transformation: from many fields to one, or one field to many
22. Transformation
• Main step where the ETL adds value
• Actually changes data and provides guidance whether data can be
used for its intended purposes
• Performed in staging area
24. Need Of Staging Area:
It makes it possible to restart
At least, some of the phases independently from the others.
For example, if the transformation step fails,
it should not be necessary to restart the Extract step.
The staging area should is be accessed by the load ETL process only.
It should never be available to anyone else,
particularly not to end users as it is not intended for data presentation to
the end-user may contain incomplete or in-the-middle-of-the-processing data.
Staging means that the data is simply dumped to the
location (called the Staging Area)
so that it can then be read by the next processing
phase.
The staging area is used during ETL
process to store intermediate results of
processing.
25. Data Quality paradigm
• Correct
• Unambiguous/Clear
• Consistent
• Complete
• Data quality checks are run at 2 places - after extraction and
after cleaning and confirming additional check are run at this
point
Transformation
27. Transformation - Cleaning Data
• Anomaly Detection
– Data sampling – count(*) of the rows for a department column
• Column Property Enforcement
– Null Values in columns
– Numeric values that fall outside of expected high and lows
– Columns whose lengths are exceptionally short/long
– Columns with certain values outside of discrete valid value sets
28. • The cleaning step is one of the most important as it ensures the quality of the data in
the data warehouse. Cleaning should perform basic data unification rules, such as:
• Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
• Convert null values into standardized Not Available/Not Provided value
• Convert phone numbers, ZIP codes to a standardized form
• Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str.
• Validate address fields against each other (State/Country, City/State, City/ZIP code,
City/Street).
Cleansing
30. Cleansing Further Leads to ETL
process
Staged Data
Cleaning
And
Confirming
Errors
Stop
Loading
Yes
No
31. Transformation - Confirming
• Structure Enforcement
– Tables have proper primary and foreign keys
– Obey referential integrity
• Data and Rule value enforcement
– Simple business rules
– Logical data checks
33. During the load step, it is necessary to ensure that the load is performed
correctly and with as little resources as possible. The target of the Load
process is often a database. In order to make the load process efficient,
it is helpful to disable any constraints and indexes before the load and
enable them back only after the load completes. The referential integrity
needs to be maintained by ETL tool to ensure consistency.
Data Loading:
34. Loading Can Be…..
Full Load is the entire data dump load
taking place the very first time. In this we
give the last extract date as empty so that
all the data gets loaded
Incremental - Where delta or difference
between target and source data is dumped
at regular intervals. Here we give the last
extract date such that only records after
this date are loaded.
35. Why Incremental?
Speed. Opting to do a full load on larger datasets will take a
great amount of time and other server resources. Ideally all
the data loads are performed overnight with the expectation
of completing them before users can see the data the next
day. The overnight window may not be enough time for the
full load to complete.
Preserving history. When dealing with a OLTP source that is
not designed to keep history, a full load will remove history
from the destination as well, since full load will remove all the
records first, remember! So a full load will not allow you to
preserve history in the data warehouse.
36. Full Load vs. Incremental Load:
Full Load Incremental Load
Truncates all rows and
loads from scratch.
New records and updated
ones are loaded
Requires more time. Requires less time.
Can easily be guaranteed Difficult. ETL must check
for new/updated rows.
Can be lost. Retained.
38. Dimensions
• Qualifying characteristics that provide additional perspectives to a
given fact
– DSS data is almost always viewed in relation to other data
• Dimensions are normally stored in dimension tables
39. Facts
• Numeric measurements (values) that represent a specific
business aspect or activity
• Stored in a fact table at the center of the star scheme
• Contains facts that are linked through their dimensions
• Can be computed or derived at run time
• Updated periodically with data from operational databases