2. Table of Contents:
What is Process management?
Data Warehouse Process Architecture:
Data warehouse architecture involves the following components:
Load management
Warehouse management
Query management.
The 3 Perspectives for the Process Model.
Conceptual
Logical
physical
4. What is Process management?
Process managers are responsible for maintaining the flow of
data both into and out of the data warehouse.
Three different types of process managers:
o Load manager
o Warehouse manager
o Query manager
Source data
5. Data Sources:
The data is extracted from the operational databases or the
external information providers.
Internal, external, production, archived.
Gateway are the application programs that are used to extract
data. It is supported by
underlying DBMS and allows the client program to generate
SQL to be executed at a server.
Open Database Connection (ODBC) and Java Database
Connection (JDBC) are examples of
gateway.
8. Load Manager:
Includes all of the software and utilities
required to:
Extract source system data and move it
to the warehouse Environment.
Complete basic Transformation to
ensure that nonessential data is eliminated
and other data is converted to appropriate
data types.
Fast load data into a staging area where
it can be subsequently manipulated.
9. EXTRACT
Some of the data elements in the operational database can be reasonably be expected to
be useful in the decision making, but others are of less value for that purpose.
For this reason, it is necessary to extract the relevant data from the operational
database before bringing into the data warehouse. Many commercial tools are available
to help with the extraction process.
Data Junction is one of the commercial products.
10. EXTRACT(Cont..)
The user of one of these tools typically has an easy-to-use windowed
interface by which to specify the following:
oWhich files and tables are to be accessed in the source database?
oWhich fields are to be extracted from them? This is often done
internally by SQL Select statement.
oWhat are those to be called in the resulting database?
oWhat is the target machine and database format of the output?
oOn what schedule should the extraction process be repeated?
11. TRANSFORM
The operational databases developed can be based on any set of priorities, which
keeps changing with the requirements.
Deals with rectifying any inconsistency.
One of the most common transformation issues is ‘Attribute Naming Inconsistency’.
Once all the data elements have right names, they must be converted to common
formats.
12. TRANSFORM(Cont..)
The conversion may encompass the following:
Characters must be converted ASCII to EBCDIC or vise versa.
Mixed Text may be converted to all uppercase for consistency.
Numerical data must be converted in to a common format.
Data Format has to be standardized.
Measurement may have to convert. (Rs/ $)
Coded data (Male/ Female, M/F) must be converted into a
common format.
13. LOADING
Loading often implies physical movement of the data from the computer(s)
storing the source database(s) to that which will store the data warehouse
database, assuming it is different.
This takes place immediately after the extraction phase.
The most common channel for data movement is a high-speed
communication link.
Ex: Oracle Warehouse Builder is the API from Oracle, which provides the
features to perform the ETL task on Oracle Data Warehouse.
15. Warehouse Manager
The warehouse manager performs all the operations associated with
the management of data in the warehouse.
Constructed using vendor data management tools and custom-built
programs.
16.
17. Detailed Data
Stores all the detailed data in the database schema.
In most cases, the detailed data is not stored online but aggregated to
the next level of detail.
On a regular basis, detailed data is added to the warehouse to
supplement the aggregated data.
18. Lightly and Highly Summarized Data
Stores all the pre-defined lightly and highly aggregated data generated by the warehouse
manager.
Transient as it will be subject to change on an on-going basis in order to respond to
changing query profiles.
The purpose of summary information is to speed up the performance of queries.
Removes the requirement to continually perform summary operations (such as
sort or group by) in answering user queries.
The summary data is updated continuously as new data is loaded into the
warehouse.
19. Archive / Backup Data
Stores detailed and summarized data for the purposes of archiving
and backup.
May be necessary to backup online summary data if this data is kept
beyond the retention period for detailed data.
The data is transferred to storage archives such as magnetic tape or
optical disk.
21. Functions of Warehouse Manager
Analysis the data to perform consistency and referential integrity checks.
Creates indexes, business views, partition views against the base data.
Generates new aggregations and updates the existing aggregations.
Generates normalizations.
Transforms and merges the source data into the temporary store of the
published data warehouse.
22. Cont..
Backs up the data in the data warehouse.
Archives the data that has reached the end of its captured life.
Note:
A warehouse manager analyses query profiles to determine whether
the index and aggregations are appropriate.
24. Responsible for directing the queries to the suitable
tables.
Speed of querying and response generation can be
increased.
Also responsible for scheduling the execution of the
queries posed by the user.
Query Manager
26. Query redirection via C tool or RDBMS
Stored procedures
Query management tool
Query scheduling via C tool or RDBMS
Query scheduling via third-party software
Query Manager Architecture
27. Performs all operations associated with management of user queries.
Component is usually constructed using
vendor end-user access tools,
data warehousing monitoring tools,
database facilities
custom built programs.
The complexity of a query manager is determined by facilities provided
by the end-user access tools and database.
Query Manager Functionality
28. Detailed Information
Not kept online, rather it is aggregated to the next level
of detail and then archived to tape.
Part of data warehouse keeps the detailed information in
the starflake schema.
loaded into the data warehouse to supplement the
aggregated data.
30. In this area of data warehouse the predefined aggregations are kept.
These aggregations are generated by warehouse manager.
This area changes on ongoing basis in order to respond to the changing query profiles.
Speed up the performance of common queries.
Increases the operational cost.
It needs to be updated whenever new data is loaded into the data warehouse.
It may not have been backed up, since it can be generated fresh from the detailed information.
Summary Information
31. It presents the data to the user in a form they understand.
It schedules the execution of the queries posted by the
end-user.
It stores query profiles to allow the warehouse manager to
determine which indexes and aggregations are
appropriate.
Functions of Query Manager
32. Logical perspective: what steps it consists of
Physical perspective: how they are to be performed
Conceptual perspective: why these steps exist
3 Perspectives for the Process Model
33. conceptual perspective which abstractly represents the basic
interrelationships between data warehouse stakeholders and processes
in a formal way
A central logical perspective part of the model, which captures the basic
structure and data characteristics of a process.
physical perspective counterpart which provides specific details over the
actual components that execute the process.
3 Perspectives for the Process Model
35. Major purpose –
to help stakeholders
understand the reasoning behind decisions on the
architecture
physical characteristics of data
warehouse processes
Conceptual Perspective
36. Each Type in the logical perspective is the counterpart of a Concept in
the conceptual perspective.
Concept represents a class of real-world objects, in terms of a
conceptual metamodel
the Entity-Relationship
UML notation
Both Types and Concepts are constructed from Fields , through the
attribute fields
Consider Field to be a subtype both of LogicalObject and
ConceptualObject.
Concept
37. Central conceptual entity
Generalizes the conceptual counterparts of activities, stakeholders
and data stores
The class Role is used to express the interdependencies of these
entities, through the attribute RelatesTo. Activity Role, Stakeholder
Concept are specializations of Roles for processes, persons and
concepts in the conceptual perspective.
Each Role represents a person, program or data store participating in
the environment of a process,
ROLE
38. LOGICAL PERSPECTIVE
Captures the basic structure and data characteristics of a
process.
In the logical perspective, the modeling is concerned with the
functionality of an activity, describing what this particular
activity is about in terms of consumption and production of
information.
39.
40. Physical Perspective
While the logical perspective covers the structure (what?) of a process, the
physical perspective covers the details of its execution (how?).
physical perspective counterpart which provides specific details over the
actual components that execute the process.
The information of the physical perspective can be used to trace and monitor
the execution of data warehouse processes
41. Summary
Process managers are responsible for maintaining the flow of data.
Load manager performs the operations required to extract and load
the data into the database.
The warehouse manager is responsible for the warehouse
management process.
The query manager is responsible for directing the queries to suitable
tables.
3 Perspectives for the Process Model
Logical perspective
Physical perspective
Conceptual perspective
42. Refrences
Panos Vassiliadis, Christoph Quix, Yannis Vassiliou, Matthias Jarke, DATA WAREHOUSE
PROCESS MANAGEMENT National Technical University of Athens, Dept. of Electrical and
Computer Eng., Computer Science Division, Iroon Polytechniou 9, 157 73, Athens, Greece
{pvassil,yv}@dbnet.ece.ntua.gr
www.tutorialspoint.com/dwh