This document outlines the ETL (Extract, Transform, Load) process essential for data warehousing, detailing the stages of extraction, transformation, and loading of data, as well as data quality considerations. It describes various ETL tools, best practices, and challenges faced in ensuring data quality throughout the process. The document emphasizes the roles of different stakeholders in data quality initiatives and the adoption of data quality tools for error discovery and correction.
An overview of ETL process and its significance in data warehousing, focusing on extraction, transformation, loading, and data quality.
Definition and importance of ETL in data warehousing; integration of multiple data sources for a consistent data store.
List of prominent ETL tools like Google BigQuery, Informatica, and Oracle Data Integrator.
Discussion on the advantages of ETL tools including scalability, simplicity, compliance, and cost-effectiveness.
Details on the extraction phase: identifying source systems, extraction methods, issues, and best practices.
Explanation of data transformation tasks including filtering, cleaning, merging, and summarizing to standardize data.
Description of the loading phase, discussing methods to load transformed data into the data warehouse.
Exploration of data quality concepts, definitions, and importance for operational systems and business applications.
Characteristics of data quality including accuracy, completeness, redundancy, and timeliness.
Overview of challenges faced in maintaining data quality and initiating data quality frameworks.
Description of different roles involved in ensuring data quality, including data consumers, producers, and experts.
Information on tools for data cleansing, error discovery, and features to ensure data quality.
Advantages of high data quality including better analytics, customer service, and strategic decision-making.
Importance of design reviews for quality assurance in data warehouse design and the different design views.
Three approaches for constructing data warehouses: top-down, bottom-up, and combined methods.
Steps involved in designing a data warehouse including business processes, grain selection, and dimension choice.Importance of testing for data integrity and the different levels of testing in data warehouse systems.
A structured approach to testing data warehouse functionality, covering entry points and test framework design.
Aspects of testing in operational environments, such as security and management tools within a data warehouse.
Overview of the need for monitoring data warehouses to ensure performance, usability, and compliance.
UNIT 2
ETL Processand Maintenance of Data
Warehouse
Data Extraction, Data
Transformation, Data loading, Data
Quality, Data Warehouse design
reviews, Testing and Monitoring
the data warehouse.
3.
• ETL isa process in Data Warehousing and it stands for Extract,
Transform and Load.
• It is a process in which an ETL tool extracts the data from various
data source systems, transforms it in the staging area, finally loads
it into the Data warehouse system.
• It is data integration process that combines data from multiple data
sources into a single, consistent data store into a data warehouse.
• This may contains customize the tool to suit the need of the
enterprises.
(E.g.) ETL tool sets for long-term analysis & usage of data in
banking, insurance claims, retail sales history, etc.
ETL Process
• Google BigQuery.
•Amazon Redshift.
• Informatica – PowerCenter.
• IBM – Infosphere Information Server.
• Oracle Data Integrator.
• SQL Server Integration Services.
ETL Tools
7.
1) Scalability –unlimited scalability are available at the
click of a button. (i.e.) capacity to be changed in size.
2) Simplicity – it saves time, resources and avoids lot of
complexity.
3) Out of the box – open source ETL require
customization and cloud-based ETL requires integration.
4) Compliance – it finds easy way to skip complicated and
risky compliance setups.
5) Long-term costs – it is cheaper with open source ETL
tools but will cost make it for long run.
Benefits of ETL Tools
8.
Extraction (E):
The firststep is extraction of data, source system’s data
is accessed first and prepared further for processing and
extracting required values.
It is extracted in various formats like relational
databases, No SQL, XML and flat files, etc.
It is important to store extract data in staging area, not
directly into data warehouse as it may cause damage and
rollback will be much difficult.
Phase (Steps) of ETL Process
9.
Extraction has threeapproach -
a) Update Notification – the data is changed or altered in the
source system, it will notify users about the change.
b) Incremental Extract – many systems are incapable of
providing notification but are efficient enough to track down the
changes that made to source data.
c) Full extract – the whole data is extracted, when system is
neither able to notify nor able to track down the changes. Old
copy of data is maintained to identify the change.
(E.g.) phone numbers, Email conversion to standard form,
validation of address fields, etc.
Phase (Steps) of ETL Process
10.
The data extractionissues are -
Source Identification – identify source applications and source structures.
Method of extraction – for each data source, define whether the extraction
process is manual or tool-based.
Extraction frequency – for each data source, establish how frequently the
data extraction must be done (daily, weekly, quartely, etc.)
Time Window – for each data source, denote the time window for the
extraction process.
Job sequencing – determine whether the beginning of one job in an
extraction job stream has to wait until the previous job has finished
successfully.
Exception handling – determine how to handle input records that can’t be
extracted.
Phase (Steps) of ETL Process
11.
The following arethe guidelines adopted for extracts as best practices -
The extract processing should identify changes since the last extract.
Interface record types should be defined for the extraction based on
entities in data warehouse model.
(E.g.) Client information extracted from source may be categorized into
person attributes, contact point information, etc.
When changes sent to data warehouse, all current attributes for changed
entity should be also sent.
Any interface record should be categorized as –
Records which have been added to operational database since the last
extract.
Records which have been deleted from operational database since the last
extract.
Phase (Steps) of ETL Process
12.
Transformation (T):
The secondstep of ETL process is transformation.
A set of rules or functions are applied on extracted
data to convert it into a single standard format.
It includes dimension conversion, aggregation,
joining, derivation and calculations of new values.
Phase (Steps) of ETL Process
13.
Transformation involve thefollowing processes or tasks –
a) Filtering - Filtering – loading only certain attributes into the
data warehouse.
b) Cleaning – filling up the NULL values with some default
values, mapping U.S.A, United States, and America into USA,
etc.
c) Joining – joining multiple attributes into one.
d) Splitting – splitting a single attribute into multiple attributes.
e) Sorting – sorting tuples on the basis of some attribute
(generally key-attribute).
Phase (Steps) of ETL Process
14.
Major Data TransformationTypes:
a) Format Revisions – it include changes to the data types and length of
individual fields.
(E.g.) Product package type indicated by codes and names, in which fields
are numeric and text data types.
b) Decoding of fields – this is common when dealing with multiple source
systems and same data items are described by field values.
(E.g.) Coding for Male and Female may be 1 and 2 in one source system or
M and F in another source system.
c) Splitting of Single fields – storing of name, address, city, state data
together in a single field in earlier systems, but individual components need
to store in separate fields in data warehouse.
Phase (Steps) of ETL Process
15.
d) Merging ofinformation – this is not mean the merging of several fields
to create a single field of data.
(E.g.) information about product may come from different data sources,
product code & package type from another data source. Merging of
information denoted combination of product code, description, package
types in single entity.
e) Character set conversion – this is related to conversion of character sets
to an agreed standard character set for text data in data warehouse.
f) Conversion of units of measurements – it is required to convert the
metrics based on overseas operations to set numbers are all in one standard
unit of measurement.
g) Date / Time Conversion - this is representation of data and time in
standard formats.
Phase (Steps) of ETL Process
16.
h) Summarization –this type of transformation is creating of summaries to
be loaded in data warehouse instead of loading most granular level of data.
(E.g.) Credit card transaction not necessary to store in data warehouse for
each single transaction, instead summarize the daily transactions for each
credit card and store the summary data.
i) Key Restructuring – while extracting data from input sources, look at
primary keys of extracted records and come with keys for fact and
dimension table based on keys in extracted records.
j) Deduplication – customer files have several records for same customer in
most of the companies, which leads to creation of additional records by
mistake. It is to maintain one record of customer and link all duplicates in
source systems to single record.
Phase (Steps) of ETL Process
17.
Loading (L):
Thethird & final step of ETL process is loading.
Transformed data is finally loaded into data warehouse.
Data is updated by loading into data warehouse frequently, but regular intervals.
Indexes and constraints previously applied to data needs to be diabled before loading
commences.
The rate and period of loading is depends on requirements and varies from system to
system.
During the loads, the data warehouse has to be offline.
Time period should be identified when loads may be scheduled without affecting data
warehouse users.
It should be consider to divide the whole load process into smaller chunks and
populating a few files at a time.
Phase (Steps) of ETL Process
18.
Mode of Loading(L):
a) Load – If targeted table to be loaded already exists and data exists in
table, load process wipes out the existing data and applies data from
incoming file. If table is empty before loading, the load process applies the
data from incoming file.
b) Append – It is an extension of the load. If data already exists in table,
append process unconditionally adds the incoming data, preserving the
existing data in target table. Incoming duplicate record may be rejected
during the append process.
c) Destructive Merge – in this step, apply the incoming data to the target
data. If incoming record matches with the key of an existing record, update
the matching target record, if not then add incoming record to the target
table.
Phase (Steps) of ETL Process
19.
Mode of Loading(L):
d) Constructive Merge – this mode is opposite to the destructive merge. (i.e.) if
incoming record matches with the key of an existing record, leave the existing record,
add incoming record and mark the added record as superseding the old record.
e) Initial Load – to load the whole data warehouse in a single run. It is able to split the
load into separaet subloads and run as single loads. If you need more than one run to
create a single table, then it is scheduled to run in several days.
f) Incremental Loads – these are the applications of ongoing changes from source
system. It need a method to preserve the periodic nature of changes in data warehouse.
Constructive merge mode is an appropriate method for incremental tools.
g) Full Refresh – this application of data involves periodically rewriting the entire
data warehouse. Sometimes, it can partial refreshes to rewrite only specific tables, but
partial refreshes are rare, because dimentions table is intricately tied to the fact table.
Phase (Steps) of ETL Process
20.
Data Quality (DQ)in Data Warehouse
What is Data Quality?
A IT professional, data accuracy is quite often and its accurancy associated
with a data element.
(E.g.) Consider Customer as Entity and it has attributes like -
,,,,,, etc.
This indicates data accuracy as its attributes of customer entity describe the
particular customer.
(i.e.) if data is fit for purpose for which it is intended, then such data has
quality.
Data Quality is related to the usage for the data item as defined by the users.
Data Quality in operational systems required database records conform to
field validation edits, which is data quality, but single field edits alone don’t
constitute data quality.
Customer
Name
Customer
Address
Customer
State
Customer
Mobile No
21.
Definition:
Data quality refersto the overall utility of a dataset(s) as a function of
its ability to be easily processed and analyzed for other uses, usually by a
database, data warehouse, or data analytics system.
Data quality in a data warehouse is not just the quality of individual data
items but the quality of the full, integrated system as a whole.
(E.g.) In online ordering system, while entering the data about customers in
order entry application, we may collect demographics of each customer.
Sometimes, this demographic factors may not be needed or not much
attention.
When those data’s are try to access which is integrated whole lacks data
quality.
(E.g.) Few customer’s information may be important or may not be
importance, when we filling or writing any application form. (especially
Banking process)
22.
Data Accuracy VsData Quality
Difference between Data Accuracy and Data Quality:
Data Accuracy Data Quality
Specific instance of an entity accurately
represents that occurrence of the entity
The data item is exactly fit for the
purpose for which the business users
have defined it.
Data element defined in terms of
database technology
Wider concept grounded in the specific
business of the company
Data element conforms to validation
constraints
Relates not just to single data elements
but to the system as a whole
Individual data items have the correct
data types
Forum and content of data elements
consistent across the whole system
Traditionally relates to operational
systems
Essentially needed in a corporate-wide
data warehouse for business users
23.
Characteristics (Dimensions) ofData Quality
The data quality dimensions are -
1. Accuracy:
The value stored in the system for a data element is the right value for that
occurence of the data element. (E.g.) getting correct customer address
2. Domain Integrity:
The data value of an attribute falls in the range of allowable, defined values.
(E.g.) Male and Female for gender data element.
3. Data Type:
Value for a data attribute is actually stored as the data type defined for the
attribute. (E.g.) Name field is defined as ‘text’.
4. Consistency:
The form and content of a data field in the same across multiple source
systems. (E.g.) Product code for Product A is 1234.
24.
5. Redundancy:
The samedata must not be stored in more than one place in a system.
6. Completeness:
There are no missing values for a given attribute in the system.
7. Duplication:
Duplication of records in a system is completely resolved. (E.g.) duplicate
records are identified and created cross-reference.
8. Conformance to Business Rules:
The values of each data item adhere to prescribed business rules. (E.g.) in
auction system, sale price can’t be less than the reserve price.
9. Structural Definiteness:
Wherever data item can naturally be structured into individual components,
the item must contain this well-defined structure. (E.g.)Names are divided
into first name, middle name and last name, which reduces the missing
25.
10. Data Anomaly:
A field must be used only for the purpose for which it is defined. (E.g.) In third
line of address column for long address, it should be entered the third line of
address, not the phone numbers or fax number.
11. Clarity:
A data element may possess all the other characteristics of quality data but if the
users do not understand its meaning clearly.
12. Timely:
The users determine the timeliness of the data. (E.g.) updation of data in customer
database on daily basis.
13. Usefulness:
Every data element in the data warehouse must satisfy some requirements of the
collection of users.
14. Adherence to Data Integrity Rules:
The data stored in the relational databases of the source system must adhere to
entity integrity and referential integrity rules.
26.
Data Quality Challenges(Problems) in DW
0% 10% 20% 30% 40% 50%
Database Performance
Management Expectations
Business Rules
Data Transformation
User Expectations
Data Modeling
Data Quality
PERCENTAGE
Data Warehouse Challenges
27.
Data Quality Framework
EstablishData Quality
Steering Committee
Agree on a suitable data
quality framework
Identify the business
functions affected most by
bad data
Institute data quality policy
and standards
Select high impact data
elements and determine
priorities
Define quality measurement
parameters and benchmarks
Plan and execute data
cleansing for high impact
data elements
Plan and execute data
cleansing for other less
severe elements
Initial
Data
Cleansing
efforts
Ongoing
Data
Cleansing
efforts
IT Professionals
User Representatives
28.
Data Quality –Participants and Roles
Data
Quality
Initiatives
Data Consumer
(User Dept.)
Data Producer
(User Dept.)
Data Expert
(User Dept.) Data Correction Authority
(IT Dept.)
Data Consistency
Expert (IT Dept.)
Data Policy
Administrator
(IT Dept.)
Data Integrity
Specialist (User
Dept.)
29.
The responsibilities forthe roles are -
1. Data Consumers:
Uses the data warehouse for queries, reports and analysis. Establishes the
acceptable levels of data quality.
2. Data Producer:
Responsible for the quality of data input into the source systems.
3. Data Expert:
Expert in the subject matter and the data itself of the source systems.
Responsible for identifying pollution in the source systems.
4. Data Policy Administrator:
Ultimately responsible for resolving data corruption as data is
transformed and moved into the data warehouse.
30.
The responsibilities forthe roles are -
5. Data Integrity Specialist:
Responsible for ensuring that the data in the source systems conforms to
the business rules.
6. Data Correction Authority:
Responsible for actually applying the data cleansing techniques through
the use of tools or in-house programs.
7. Data Consistency Expert:
Responsible for ensuring that all data within the data warehouse (various
data marts) are fully synchronized.
31.
Data Quality Tools
Theuseful data quality tools are -
1. Categories of Data Cleansing tools:
It assist in two ways –
Data error discovery tools work on the source data to identify inaccuracies and
inconsistencies.
Data correction tools help fix the corrupt data, which use series of algorithms
to parse, transform, match, consolidate and correct the data.
2. Error Discovery features:
The following list of error discovery functions that data cleansing tools are
capable of performing –
Quickly and easily identify duplicate records.
Identify data items whose values are outside the range of legal domain values.
Find inconsistent data.
32.
Check forrange of allowable values.
Detect inconsistencies among data items from different sources.
Allow users to identify and quantify data quality problems.
Monitor trends in data quality over time.
Report to users on the quality of data used for analysis.
Reconcile problems of RDBMS referential integrity.
3. Data Correction features:
The following list describes the typical error correction functions that data cleansing
tools are capable of performing –
Normalize inconsistent data.
Improve merging of data from dissimilar data sources.
Group and relate customer records belonging to the same household.
Provide measurements of data quality.
Standardize data elements to common formats.
Validate for allowable values.
33.
4. DBMS forQuality Control:
The database management system used as a tool for data quality control in many
ways, especially RDBMS prevent several types of errors creeping into data
warehouse –
Domain integrity – provide domain value edits. Prevent entry of data if entered
data value is outside the defined limits of value.
Update security – prevent unauthorized updates to the databases, which stops
unauthorized users from updating data in an incorrect way.
Entity Integrity Checking – ensure that duplicate records with same primary key
value are not entered.
Minimize missing values – ensure that nulls are not allowed in mandatory fields.
Referrential Integrity Checking – ensure that relationships based on foreign keys
are pre-served.
Conformance to Business rules – use trigger programs and stored procedures to
enforce business rules.
34.
Benefits of DataQuality
Some specific areas where data quality definite benefits -
Analysis with timely informaiton.
Better Customer Service.
Newer opportunities.
Reduced costs and Risks.
Improved Productivity.
Reliable Strategic Decision Making.
35.
Data Warehouse DesignReviews
One of the most effective techniques for ensuring quality in the
operational environment is the design review.
Errors can be detected and resolved prior to coding thtough a design
review.
The cost benefot of identifying errors early in the development life cycle
is enormous.
Design review is usually done on completion of the physical design of an
application.
Some of the issues around operational design review are follows –
Transaction performance
System availability
Project readiness
Batch window adequacy
Capacity
User requirements satisfaction
36.
Views of DataWarehouse Design
The four views regarding a data warehouse design must be considered –
1. Top-Down View:
This allows the selection of relevant information necessary for data
warehouse.
This information matches current and future business needs.
2. Data Source View:
It exposes the information being captured, stored and managed by
operational systems.
This information may be documented at various levels of detail and
accuracy from individual data source tables to integrated data sources
table.
Data sources are often modeled by traditional data modeling techniques,
such as entity – relationship model.
37.
Views of DataWarehouse Design
3. Data Warehouse View:
This views includes fact tables and dimension tables.
It represents the information that is stored inside the data
warehouse, including pre-calculated totals and counts.
Information regarding the source, date and time of origin, added
to provide historical context.
4. Business Query View:
This view is the data perspective in the data warehouse from the
end-user’s view point.
38.
Data Warehouse DesignApproaches
A data warehouse can be built using three approaches -
a) The top-down approach:
It starts with the overall design and planning.
It is useful in cases where the technology is mature and well
known, and where the business problems that must be solved
are clear and well understood.
The process begins with an ETL process working from external
data sources.
In the top-down model, integration between the data warehouse
and the data marts is automatic as long as the data marts as
subsets of data warehouse is maintained.
39.
Data Warehouse DesignApproaches
b) The bottom-up approach:
The bottom-up approach starts with experiments and prototypes.
This is useful in the early stage of business modeling and
technology development.
It allows an organization to move forward at considerably less
expense and to evaluate the benefits of the technology before
making significant commitments.
This approach is to construct the data warehouse incrementally
over time from independently developed data marts.
In this approach, data flows from sources into data marts, then into
the data warehouse.
40.
Data Warehouse DesignApproaches
c) The Combined approach:
In this approach, both the top-down approach and bottom-up
approaches are exploited.
In combined approach, an organization can exploit the planned
and strategic nature of top-down approach while retaining the
rapid implementation and opportunistic application of the
bottom-up approach.
41.
Data Warehouse DesignProcess
The general data warehouse design process involves the following steps -
Step 1: Choosing the appropriate Business process:
Based on need & requirements, there exist two types of models like data
warehouse model and data mart model.
Data warehouse model is chosen if business process is organisational
and has many complex object collections.
A data mart model is chosen if business process is departmental and
focus on analysis of particular process.
Step 2: Choosing the grain of the business process:
Grain is defined as fundamental data which are represented in fact table
for chose business process.
(E.g.) individual snapshots, individual transactions, etc.
42.
Data Warehouse DesignProcess
Step 3: Choosing the Dimensions:
It includes selecting various dimensions such as time, item,
status, etc., which need in be applied to each fact table record.
Step 4: Choosing the measures:
It includes selecting various dimensions such as items_sold,
euros_sold, etc., which helps in filling up each fact table
record.
43.
Testing & Monitoringthe Data Warehouse
Definition:
Data Warehouse testing is the process of building and
executing comprehensive test cases to ensure that data in a
warehouse has integrity and is reliable, accurate and consistent
with the organization’s data framework.
Testing is very important for data warehouse systems for data
validation and to make them work correctly and efficiently.
Data Warehouse Testing is a series of Verification and
Validation activities performed to check for the quality and
accuracy of the Data Warehouse and its contents.
44.
There are fivebasic levels of testing performed on a data warehouse –
1. Unit Testing:
This type of testing is being performed at the developer’s end.
In unit testing, each unit / component of modules is separately tested.
Each modules of the whole data warehouse (i.e.) program, SQL
Script, procedure,, Unix shell is validated and tested.
2. Integration Testing:
In this type of testing the various individual units / modules of the
application are brought together or combined and then tested against
the number of inputs.
It is performed to detect the fault in integrated modules and to test
whether the various components are performing well after integration.
45.
3. System Testing:
System testing is the form of testing that validates and tests the whole data
warehouse application.
This type of testing is being performed by technical testing team.
This test is conducted after developer’s team performs unit testing and the
main purpose of this testing is to check whether the entire system is working
altogether or not.
4. Acceptance Testing:
To verify that the entire solution meets the business requirements and
successfully supports the business processes from a user’s perspective.
5. System Assurance Testing:
To ensure and verify the operational readiness of the system in a production
environment.
This is also referred to as the warranty period coverage.
46.
Challenges of datawarehouse testing are -
Data selection from multiple source and analysis that follows
pose great challenge.
Volume and complexity of the data, certain testing strategies are
time consuming.
ETL testing requires hive SQL skills, thus it pose challenges for
tester who have limited SQL skills.
Redundant data in a data warehouse & Inconsistent and
inaccurate reports.
47.
Data Warehouse TestingProcess
Testing a data warehouse is a multi-step process that involves
activities like identifying business requirements, designing test
cases, setting up a test framework, executing the text case and
validating data.
The steps for testing process are –
Step 1: Identify the various entry points:
As loading data into a warehouse involves multiple stages, it’s
essential to find out the various entry points to test data at each of
those stages.
If testing is done only at the destination, it can be confusing when
errors are found as it becomes more difficult to determine the root
cause.
48.
Step 2: Preparethe required collaterals:
Two fundamental collaterals required for the testing process are database
schema representation and a mapping document.
The mapping document is usually a spreadsheet which maps each
column in the source database to the destination database.
A data integration solution can help generate the mapping document,
which is then used as an input to design test cases.
Step 3: Design an elastic, automated and integrated testing framework:
ETL is not a one-time activity. While some data is loaded all at once and
some through batches, new updates may trickle in
through streaming queues.
A testing framework design has to be generic and architecturally flexible
to accommodate new and diverse data sources and types, more volumes,
and the ability to work seamlessly with cloud and on-premises
49.
Integrating the testframework with an automated data solution
(that contains features as discussed in the previous section)
increases the efficiency of the testing process.
Step 4: Adopt a comprehensive testing approach:
The testing framework needs to aim for 100% coverage of the
data warehousing process.
it’s important to design multiple testing approaches such as unit,
integration, functional, and performance testing.
The data itself has to be scrutinized for many checks that
includes looking for duplicates, matching record counts,
completeness, accuracy, loss of data, and correctness of
transformation.
50.
Testing Operational Environment
Thereare no. of aspects that need to be tested as below –
1. Security:
A separate security document is required for security testing. This document
contains a list of disallowed operations and devising tests for each.
2. Scheduler:
Scheduling software is required to control the daily operations of a data
warehouse. It needs to be tested during system testing. The scheduling
software requires an interface with the data warehouse, which will need the
scheduler to control overnight processing and the management of
aggregations.
3. Disk Configuration:
Disk configuration also needs to be tested to identify I/O bottlenecks. The test
should be performed with multiple times with different settings.
51.
4. Management Tools:
Itis required to test all the management tools during system
testing. Here is the list of tools that need to be tested.
• Event manager
• System manager
• Database manager
• Configuration manager
• Backup recovery manager
52.
Testing the Database
Thedatabase is tested in following three ways –
1. Testing the database manager and monitoring tools:
To test the database manager and the monitoring tools, they should be used in
the creation, running, and management of test database.
2. Testing database features:
Here is the list of features that we have to test −
– Querying in parallel
– Create index in parallel
– Data load in parallel
3. Testing database performance:
Query execution plays a very important role in data warehouse performance
measures. There are sets of fixed queries that need to be run regularly and
they should be tested.
53.
Data Warehouse Monitoring
Datawarehouse monitoring helps to understand how
the data warehouse is performing.
Some of the several reasons for monitoring are –
It ensures top performance.
It ensures excellent usability.
It ensures the business can run efficiently.
It prevents security issues.
It ensures governance and compliance.