Data Warehousing

Introduction
• Data warehousing is a process for assembling and managing
data from various sources for the purpose of gaining a single
detailed view of an enterprise.
• A data warehouse is an integrated subject oriented, time
variant and non-volatile repository of information in support of
management’s decision making process. -Bill
Inmon(1992)
7/2/2019 2Compiled By: Kamal Acharya

Contd..
• The benefits of implementing a data warehouse are as follows:
– To provide a single version of truth about enterprise information.
– To speed up query involving aggregations across multiple attributes.
– To provide a system in which managers who do not have a strong
technical background are able to run complex queries.
– To provide a database that stores relatively clean data.
– To provide a database that sores historical data.

Contd..
• Difference between OLTP and data warehouse systems:
Property OLTP Data warehouse
Nature of database 2D tables Multidimensional (cube)
Indexes Few many
Joins Many Few
Duplicated data Normalized data de-normalized data
Derived data and
aggregates
Rare Common
Queries Mostly pre-defined and simple Mostly ad-hoc and complex
Updates All the times Only refreshed
Historical data Often not available essential

Operational data stores
• An operational data store is a subject oriented, integrated,
volatile, current valued data store, containing only corporate
detailed data.
• It is designed to provide a consolidated view of the enterprise’s
current operational information.

Contd..
• The ODS is subject-oriented:
– That is, it is organized around the major data subjects of an enterprise
– E.g., in a university, the subjects might be students, lecturers and
courses while in a company subjects might be customers, sales and
products.

Contd..
• The ODS is integrated:
– That is, it is a collection of subject –oriented data from a variety of
systems to provide an enterprise-wide view of the data.

Contd..
• The ODS is current valued data:
– That is, an ODS is up-to-date and reflects the current status of the
information.
– An ODS does not include historical data.

Contd..
• The ODS is volatile:
– That is, the data in the ODS changes frequently as new information
refreshes the ODS.

Contd..
• The ODS is detailed:
– That is, the ODS is detailed enough to serve the needs of the
operational management staff in the enterprise.

Contd..
• An ODS may also be used as an interim database for a data
warehouse.
• An ODS may be viewed as an enterprise’s short term memory in
that it stores only very recent information.
• Benefits:
– Improved access to important operational data.
– Assist in better understanding of the business and the customer.
– More efficient in generating current report without having to access the
OLTP systems.
– Shorten the time required to implement a data warehouse.

Contd….
Fig: Typical architecture of operational data stores
ETL
ETL
ETL

Contd..
• Why separate ODS?
– It is possible to carry out the ODS queries on the existing OLTP
systems but the OLTP system have to provide a quick response to
operational users and business cannot afford to have response time
suffer when a manager is running a complex query.
– Time to time complex queries degrade the performance of the OLTP
systems.
– So, to make query processing more efficiently without loading OLTP
systems, a separate ODS is required.

Contd..
• Comparison of the ODS and data warehouse:
ODS DW
Data of high quality at detailed level and
assured availability
Data may not be perfect, but sufficient for
strategic analysts, data does not have to be
highly available.
Contain current and near-current data Contains historical data
Real time and near real time data loads Normally batch data loads
Typically detailed data only Contains summarized and detailed data
Modeled to support rapid data updates Modeled to optimize query performance
Transaction similar to those in OLTP systems Complex queries processing larger volumes
of data
Used for operational reporting Used for management reporting

Contd..
• Relationship between OLTP, ODS and DW systems:
OLTP DWODS

Multi-Tiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources Front-End Tools
Serve
Data Marts
Operational
DBs
other
sources
Data Storage
OLAP Server
7/2/2019 Compiled By: Kamal Acharya 16

ETL(Extract, Transform, Load)
• ETL, Short for extract, transform, and load are the database
functions.
• ETL is used to migrate data from one database to another, to
form data marts and data warehouses and also to convert
databases from one format or type to another.

Contd..
• To get data out of the source and load it into the data
warehouse:
– Data is extracted from an OLTP database, transformed to match the
data warehouse schema and loaded into the data warehouse database
– Many data warehouses also incorporate data from non-OLTP systems
such as text files, legacy systems, and spreadsheets; such data also
requires extraction, transformation, and loading

Contd..
Fig: ETL process

• Extract Reads data from a specified source and extracts a desired
subset of data.
• Data is extracted from heterogeneous data sources
• Each data source has its distinct set of characteristics that need to
be managed and integrated into the ETL system in order to
effectively extract data.
Data Extraction

Data Transformation
• It is the main step where the ETL adds value.
• Transform - Uses rules, or creating combinations with other
data, to convert source data to the desired state.
• convert data from format of operational system to format of
data warehouse
• Actually changes data and provides guidance whether data can
be used for its intended purposes.

Data Loading
• Writes transformed data into the target warehouse and create
indexes.
• i.e., Data are physically moved to the data warehouse
• The loading process can be broken down into 2 different types:
• Initial Load
• Continuous Load (loading over time)

Data warehouse processes
• There are four major processes that contribute to a data
warehouse:
– Extract and load the data.
– Cleaning and transforming the data.
– Backup and archive the data.
– Managing queries and directing them to the appropriate data sources.

Contd..
• Extract and Load Process
– Data extraction takes data from the source systems. Data load takes the
extracted data and loads it into the data warehouse.
– Before loading the data into the data warehouse, the information
extracted from the external sources must be reconstructed.

Contd..
• Clean and Transform Process:
– Once the data is extracted and loaded into the temporary
data store, it is time to perform Cleaning and Transforming.
– Here is the list of steps involved in Cleaning and
Transforming
• Clean and transform the loaded data into a structure to speed up the
queries.
• Partition the data will optimize the hardware performance and
simplify the management of data warehouse.
• Aggregation to speed up common queries.

Contd..
• Backup and Archive the Data
– In order to recover the data in the event of data loss,
software failure, or hardware failure, it is necessary to keep
regular back ups.
– Archiving involves removing the old data from the system
in a format that allow it to be quickly restored whenever
required.

Contd..
• Query Management Process
– This process performs the following functions:
• manages the queries.
• helps speed up the execution time of queries.
• directs the queries to their most effective data sources.
• ensures that all the system sources are used in the most
effective way.
• monitors actual query profiles.

Data warehouse process managers and their functions
• Process managers are responsible for maintaining the flow of data both into
and out of the data warehouse. There are three different types of process
managers:
– Load manager
– Warehouse manager
– Query manager

Contd..
• Data Warehouse Load Manager
– Load manager performs the operations required to extract and load the
data into the database.
– The size and complexity of a load manager varies between specific
solutions from one data warehouse to another.

Contd..
Fig: Load Manager Architecture

Contd..
• The load manager does performs the following functions:
– Extract data from the source system.
– Fast load the extracted data into temporary data store.
– Perform simple transformations into structure similar to the one in the
data warehouse.

Contd..
• Warehouse Manager
– The warehouse manager is responsible for the warehouse management
process. The size and complexity of a warehouse manager varies
between specific solutions.
– Warehouse Manager Architecture: A warehouse manager includes the
following:
– The controlling process
– Stored procedures.
– Backup/Recovery tool
– SQL scripts

Contd..
Fig: Warehouse manager architecture

Contd..
• Functions of Warehouse Manager: A warehouse manager performs the
following functions :
– Analyzes the data to perform consistency and referential integrity checks.
– Creates indexes, business views, partition views against the base data.
– Generates new aggregations and updates the existing aggregations.
– Generates normalizations.
– Transforms and merges the source data of the temporary store into the published
data warehouse.
– Backs up the data in the data warehouse.
– Archives the data that has reached the end of its captured life.
• Note − A warehouse Manager analyzes query profiles to determine whether
the index and aggregations are appropriate.

Contd..
• Query Manager
– The query manager is responsible for directing the queries to suitable
tables.
– By directing the queries to appropriate tables, it speeds up the query
request and response process.
– In addition, the query manager is responsible for scheduling the
execution of the queries posted by the user.

Contd..
• Query Manager Architecture: A query manager includes the
following components:
– Query redirection via C tool or RDBMS
– Stored procedures
– Query management tool
– Query scheduling via C tool or RDBMS

Contd..
Fig: Query manager architecture

Contd..
• Functions of Query Manager
– It presents the data to the user in a form they understand.
– It schedules the execution of the queries posted by the end-user.
– It stores query profiles to allow the warehouse manager to determine
which indexes and aggregations are appropriate.

Data warehouse design
• There are a number of ways of conceptualizing a data
warehouse
– A three –level architecture(OLTP, central data warehouse, and data
marts)
– Another three-level architecture( OLTP, ODS, and data warehouse).
• Whatever the architecture, a data warehouse needs to have a
data model that can form the basis for implementing it.

Contd..
• The entity-relationship data model is commonly used in the design
of relational databases, where a database schema consists of a set of
entities and the relationships between them. Such a data model is
appropriate for on-line transaction processing.
• A data warehouse, however, requires a concise, subject-oriented
schema that facilitates on-line data analysis.
• The most popular data model for a data warehouse is a
multidimensional model. Such a model can exist in the form of a
star schema, a snowflake schema, or a fact constellation schema.

Contd..
• Star schema:
– Consists of a central fact table and a set of surrounding dimension
tables on which the facts depend.
– a central fact table contains the keys to each dimensions.
– The fact table also contains the attributes, e.g., dollars sold and units
sold.
– Each dimension in a star schema is represented with only one-
dimension table.
– This dimension table contains the set of attributes.

Contd..
• The following diagram shows the sales data of a company with respect to
the four dimensions, namely time, item, branch, and location.

Contd..
• Snowflake Schema:
– Star schemas may be refined into snowflake schemas if we wish to
provide support for dimension hierarchies by allowing the dimension
tables to have sub-tables to represent the hierarchies.
– For example, the item dimension table in star schema is split into two
dimension tables, namely item and supplier table as shown in figure
below.

Contd..

Contd..
• Fact constellation :
– Sophisticated applications may require multiple fact tables to share
dimension tables.
– This kind of schema can be viewed as a collection of stars, and hence
is called a galaxy schema or a fact constellation.

Contd..
• The following illustration shows two fact tables, namely Sales and
shipping. Time, item, and location dimension tables are shared between
the sales and shipping fact table.

Data warehouse Implementation
• What ever the data warehouse architecture, building a data
warehouse is likely to consists of the following steps.
– Requirement analysis and capacity planning
– Hardware integration
– Modeling
– Physical modeling
– sources
– ETL
– Populate the data warehouse
– User application
– Roll-out the warehouse and applications

Contd..
• Requirement analysis and capacity planning:
– The first step involves defining the enterprise needs, defining
architecture, carrying out capacity planning, selecting the hardware and
software tools.

Contd..
• Hardware integration:
– This involves integrating the servers, the storage and the client tool.

Contd..
• Modeling:
– This is a major step that involves designing the warehouse schema and
views.

Contd..
• Physical modeling:
– This involves designing the physical warehouse organization, and
placement, partitioning, and access methods.

Contd..
• Sources:
– Identifying and connecting the sources using gateways, ODBC drivers,
or other wrappers.

Contd..
• ETL:
– Designing and implementing the ETL process. This may involve
identifying a suitable ETL tool vendor and purchasing and
implementing the tool.

Contd..
• Populate the data warehouse:
– After ETL, populating the warehouse with the schema and view
definition.

Contd..
• User application:
– This step involves designing and implementing end-user applications
since for the data warehouse to be useful there must be end-user
applications.

Contd..
• Roll-out the warehouse and applications:
– Once the data warehouse has been populated and the end-user
applications tested , the warehouse system and the applications may be
rolled out for the user community to use.

Guidelines for data warehouse implementation
• The following are the general guide lines for successful implementation of
a data warehouse, not each of them will be applicable to every data
warehouse project.
– Build incrementally
– Need a champion
– Senior management support
– Ensure quality
– Corporate strategy
– Business plan
– Training
– Adaptability
– Joint management

Homework
• Why do many enterprises need a data warehouse?
• How is data warehouse different from a database? How are they
similar?
• What is ODS and what is it used for?
• State and explain the major steps involved in the ETL process.
• What is the major difference between the star schema and the
snowflake schema.
• Explain the implementation steps of data warehouse.
• Explain the guidelines for implementing a data warehouse.

Contd..
• Describe how a data warehouse is modeled and implemented using
different schemas for data warehouse. Explain using an example.
• What do you mean by business analysis framework for data warehouse
design.
• Explain in detail about the data warehouse design process.
• From the software engineering point of view what are the different steps in
the design and construction of a data warehouse.

Thank You !
Compiled By: Kamal Acharya 607/2/2019

Data Warehousing

More Related Content

What's hot

Similar to Data Warehousing

More from Kamal Acharya

Recently uploaded

Data Warehousing