Data warehousing
Introduction
• Data warehousing is a process for assembling and managing
data from various sources for the purpose of gaining a single
detailed view of an enterprise.
• A data warehouse is an integrated subject oriented, time
variant and non-volatile repository of information in support of
management’s decision making process. -Bill
Inmon(1992)
7/2/2019 2Compiled By: Kamal Acharya
Contd..
• The benefits of implementing a data warehouse are as follows:
– To provide a single version of truth about enterprise information.
– To speed up query involving aggregations across multiple attributes.
– To provide a system in which managers who do not have a strong
technical background are able to run complex queries.
– To provide a database that stores relatively clean data.
– To provide a database that sores historical data.
7/2/2019 3Compiled By: Kamal Acharya
Contd..
• Difference between OLTP and data warehouse systems:
7/2/2019 4Compiled By: Kamal Acharya
Property OLTP Data warehouse
Nature of database 2D tables Multidimensional (cube)
Indexes Few many
Joins Many Few
Duplicated data Normalized data de-normalized data
Derived data and
aggregates
Rare Common
Queries Mostly pre-defined and simple Mostly ad-hoc and complex
Updates All the times Only refreshed
Historical data Often not available essential
Operational data stores
• An operational data store is a subject oriented, integrated,
volatile, current valued data store, containing only corporate
detailed data.
• It is designed to provide a consolidated view of the enterprise’s
current operational information.
7/2/2019 5Compiled By: Kamal Acharya
Contd..
• The ODS is subject-oriented:
– That is, it is organized around the major data subjects of an enterprise
– E.g., in a university, the subjects might be students, lecturers and
courses while in a company subjects might be customers, sales and
products.
7/2/2019 6Compiled By: Kamal Acharya
Contd..
• The ODS is integrated:
– That is, it is a collection of subject –oriented data from a variety of
systems to provide an enterprise-wide view of the data.
7/2/2019 7Compiled By: Kamal Acharya
Contd..
• The ODS is current valued data:
– That is, an ODS is up-to-date and reflects the current status of the
information.
– An ODS does not include historical data.
7/2/2019 8Compiled By: Kamal Acharya
Contd..
• The ODS is volatile:
– That is, the data in the ODS changes frequently as new information
refreshes the ODS.
7/2/2019 9Compiled By: Kamal Acharya
Contd..
• The ODS is detailed:
– That is, the ODS is detailed enough to serve the needs of the
operational management staff in the enterprise.
7/2/2019 10Compiled By: Kamal Acharya
Contd..
• An ODS may also be used as an interim database for a data
warehouse.
• An ODS may be viewed as an enterprise’s short term memory in
that it stores only very recent information.
• Benefits:
– Improved access to important operational data.
– Assist in better understanding of the business and the customer.
– More efficient in generating current report without having to access the
OLTP systems.
– Shorten the time required to implement a data warehouse.
7/2/2019 11Compiled By: Kamal Acharya
Contd….
Fig: Typical architecture of operational data stores
7/2/2019 12Compiled By: Kamal Acharya
ETL
ETL
ETL
Contd..
• Why separate ODS?
– It is possible to carry out the ODS queries on the existing OLTP
systems but the OLTP system have to provide a quick response to
operational users and business cannot afford to have response time
suffer when a manager is running a complex query.
– Time to time complex queries degrade the performance of the OLTP
systems.
– So, to make query processing more efficiently without loading OLTP
systems, a separate ODS is required.
7/2/2019 13Compiled By: Kamal Acharya
Contd..
• Comparison of the ODS and data warehouse:
7/2/2019 14Compiled By: Kamal Acharya
ODS DW
Data of high quality at detailed level and
assured availability
Data may not be perfect, but sufficient for
strategic analysts, data does not have to be
highly available.
Contain current and near-current data Contains historical data
Real time and near real time data loads Normally batch data loads
Typically detailed data only Contains summarized and detailed data
Modeled to support rapid data updates Modeled to optimize query performance
Transaction similar to those in OLTP systems Complex queries processing larger volumes
of data
Used for operational reporting Used for management reporting
Contd..
• Relationship between OLTP, ODS and DW systems:
7/2/2019 15Compiled By: Kamal Acharya
OLTP DWODS
Multi-Tiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources Front-End Tools
Serve
Data Marts
Operational
DBs
other
sources
Data Storage
OLAP Server
7/2/2019 Compiled By: Kamal Acharya 16
ETL(Extract, Transform, Load)
• ETL, Short for extract, transform, and load are the database
functions.
• ETL is used to migrate data from one database to another, to
form data marts and data warehouses and also to convert
databases from one format or type to another.
7/2/2019 17Compiled By: Kamal Acharya
Contd..
• To get data out of the source and load it into the data
warehouse:
– Data is extracted from an OLTP database, transformed to match the
data warehouse schema and loaded into the data warehouse database
– Many data warehouses also incorporate data from non-OLTP systems
such as text files, legacy systems, and spreadsheets; such data also
requires extraction, transformation, and loading
7/2/2019 18Compiled By: Kamal Acharya
Contd..
Fig: ETL process
7/2/2019 19Compiled By: Kamal Acharya
• Extract Reads data from a specified source and extracts a desired
subset of data.
• Data is extracted from heterogeneous data sources
• Each data source has its distinct set of characteristics that need to
be managed and integrated into the ETL system in order to
effectively extract data.
Data Extraction
7/2/2019 Compiled By: Kamal Acharya 20
Data Transformation
• It is the main step where the ETL adds value.
• Transform - Uses rules, or creating combinations with other
data, to convert source data to the desired state.
• convert data from format of operational system to format of
data warehouse
• Actually changes data and provides guidance whether data can
be used for its intended purposes.
7/2/2019 Compiled By: Kamal Acharya 21
Data Loading
• Writes transformed data into the target warehouse and create
indexes.
• i.e., Data are physically moved to the data warehouse
• The loading process can be broken down into 2 different types:
• Initial Load
• Continuous Load (loading over time)
7/2/2019 Compiled By: Kamal Acharya 22
Data warehouse processes
• There are four major processes that contribute to a data
warehouse:
– Extract and load the data.
– Cleaning and transforming the data.
– Backup and archive the data.
– Managing queries and directing them to the appropriate data sources.
7/2/2019 23Compiled By: Kamal Acharya
Contd..
• Extract and Load Process
– Data extraction takes data from the source systems. Data load takes the
extracted data and loads it into the data warehouse.
– Before loading the data into the data warehouse, the information
extracted from the external sources must be reconstructed.
7/2/2019 24Compiled By: Kamal Acharya
Contd..
• Clean and Transform Process:
– Once the data is extracted and loaded into the temporary
data store, it is time to perform Cleaning and Transforming.
– Here is the list of steps involved in Cleaning and
Transforming
• Clean and transform the loaded data into a structure to speed up the
queries.
• Partition the data will optimize the hardware performance and
simplify the management of data warehouse.
• Aggregation to speed up common queries.
7/2/2019 25Compiled By: Kamal Acharya
Contd..
• Backup and Archive the Data
– In order to recover the data in the event of data loss,
software failure, or hardware failure, it is necessary to keep
regular back ups.
– Archiving involves removing the old data from the system
in a format that allow it to be quickly restored whenever
required.
7/2/2019 26Compiled By: Kamal Acharya
Contd..
• Query Management Process
– This process performs the following functions:
• manages the queries.
• helps speed up the execution time of queries.
• directs the queries to their most effective data sources.
• ensures that all the system sources are used in the most
effective way.
• monitors actual query profiles.
7/2/2019 27Compiled By: Kamal Acharya
Data warehouse process managers and their functions
• Process managers are responsible for maintaining the flow of data both into
and out of the data warehouse. There are three different types of process
managers:
– Load manager
– Warehouse manager
– Query manager
7/2/2019 28Compiled By: Kamal Acharya
Contd..
• Data Warehouse Load Manager
– Load manager performs the operations required to extract and load the
data into the database.
– The size and complexity of a load manager varies between specific
solutions from one data warehouse to another.
7/2/2019 29Compiled By: Kamal Acharya
Contd..
Fig: Load Manager Architecture
7/2/2019 30Compiled By: Kamal Acharya
Contd..
• The load manager does performs the following functions:
– Extract data from the source system.
– Fast load the extracted data into temporary data store.
– Perform simple transformations into structure similar to the one in the
data warehouse.
7/2/2019 31Compiled By: Kamal Acharya
Contd..
• Warehouse Manager
– The warehouse manager is responsible for the warehouse management
process. The size and complexity of a warehouse manager varies
between specific solutions.
– Warehouse Manager Architecture: A warehouse manager includes the
following:
– The controlling process
– Stored procedures.
– Backup/Recovery tool
– SQL scripts
7/2/2019 32Compiled By: Kamal Acharya
Contd..
Fig: Warehouse manager architecture
7/2/2019 33Compiled By: Kamal Acharya
Contd..
• Functions of Warehouse Manager: A warehouse manager performs the
following functions :
– Analyzes the data to perform consistency and referential integrity checks.
– Creates indexes, business views, partition views against the base data.
– Generates new aggregations and updates the existing aggregations.
– Generates normalizations.
– Transforms and merges the source data of the temporary store into the published
data warehouse.
– Backs up the data in the data warehouse.
– Archives the data that has reached the end of its captured life.
• Note − A warehouse Manager analyzes query profiles to determine whether
the index and aggregations are appropriate.
7/2/2019 34Compiled By: Kamal Acharya
Contd..
• Query Manager
– The query manager is responsible for directing the queries to suitable
tables.
– By directing the queries to appropriate tables, it speeds up the query
request and response process.
– In addition, the query manager is responsible for scheduling the
execution of the queries posted by the user.
7/2/2019 35Compiled By: Kamal Acharya
Contd..
• Query Manager Architecture: A query manager includes the
following components:
– Query redirection via C tool or RDBMS
– Stored procedures
– Query management tool
– Query scheduling via C tool or RDBMS
7/2/2019 36Compiled By: Kamal Acharya
Contd..
Fig: Query manager architecture
7/2/2019 37Compiled By: Kamal Acharya
Contd..
• Functions of Query Manager
– It presents the data to the user in a form they understand.
– It schedules the execution of the queries posted by the end-user.
– It stores query profiles to allow the warehouse manager to determine
which indexes and aggregations are appropriate.
7/2/2019 38Compiled By: Kamal Acharya
Data warehouse design
• There are a number of ways of conceptualizing a data
warehouse
– A three –level architecture(OLTP, central data warehouse, and data
marts)
– Another three-level architecture( OLTP, ODS, and data warehouse).
• Whatever the architecture, a data warehouse needs to have a
data model that can form the basis for implementing it.
7/2/2019 39Compiled By: Kamal Acharya
Contd..
• The entity-relationship data model is commonly used in the design
of relational databases, where a database schema consists of a set of
entities and the relationships between them. Such a data model is
appropriate for on-line transaction processing.
• A data warehouse, however, requires a concise, subject-oriented
schema that facilitates on-line data analysis.
• The most popular data model for a data warehouse is a
multidimensional model. Such a model can exist in the form of a
star schema, a snowflake schema, or a fact constellation schema.
7/2/2019 40Compiled By: Kamal Acharya
Contd..
• Star schema:
– Consists of a central fact table and a set of surrounding dimension
tables on which the facts depend.
– a central fact table contains the keys to each dimensions.
– The fact table also contains the attributes, e.g., dollars sold and units
sold.
– Each dimension in a star schema is represented with only one-
dimension table.
– This dimension table contains the set of attributes.
7/2/2019 41Compiled By: Kamal Acharya
Contd..
• The following diagram shows the sales data of a company with respect to
the four dimensions, namely time, item, branch, and location.
7/2/2019 42Compiled By: Kamal Acharya
Contd..
• Snowflake Schema:
– Star schemas may be refined into snowflake schemas if we wish to
provide support for dimension hierarchies by allowing the dimension
tables to have sub-tables to represent the hierarchies.
– For example, the item dimension table in star schema is split into two
dimension tables, namely item and supplier table as shown in figure
below.
7/2/2019 43Compiled By: Kamal Acharya
Contd..
7/2/2019 44Compiled By: Kamal Acharya
Contd..
• Fact constellation :
– Sophisticated applications may require multiple fact tables to share
dimension tables.
– This kind of schema can be viewed as a collection of stars, and hence
is called a galaxy schema or a fact constellation.
7/2/2019 45Compiled By: Kamal Acharya
Contd..
• The following illustration shows two fact tables, namely Sales and
shipping. Time, item, and location dimension tables are shared between
the sales and shipping fact table.
7/2/2019 46Compiled By: Kamal Acharya
Data warehouse Implementation
• What ever the data warehouse architecture, building a data
warehouse is likely to consists of the following steps.
– Requirement analysis and capacity planning
– Hardware integration
– Modeling
– Physical modeling
– sources
– ETL
– Populate the data warehouse
– User application
– Roll-out the warehouse and applications
7/2/2019 47Compiled By: Kamal Acharya
Contd..
• Requirement analysis and capacity planning:
– The first step involves defining the enterprise needs, defining
architecture, carrying out capacity planning, selecting the hardware and
software tools.
7/2/2019 48Compiled By: Kamal Acharya
Contd..
• Hardware integration:
– This involves integrating the servers, the storage and the client tool.
7/2/2019 49Compiled By: Kamal Acharya
Contd..
• Modeling:
– This is a major step that involves designing the warehouse schema and
views.
7/2/2019 50Compiled By: Kamal Acharya
Contd..
• Physical modeling:
– This involves designing the physical warehouse organization, and
placement, partitioning, and access methods.
7/2/2019 51Compiled By: Kamal Acharya
Contd..
• Sources:
– Identifying and connecting the sources using gateways, ODBC drivers,
or other wrappers.
7/2/2019 52Compiled By: Kamal Acharya
Contd..
• ETL:
– Designing and implementing the ETL process. This may involve
identifying a suitable ETL tool vendor and purchasing and
implementing the tool.
7/2/2019 53Compiled By: Kamal Acharya
Contd..
• Populate the data warehouse:
– After ETL, populating the warehouse with the schema and view
definition.
7/2/2019 54Compiled By: Kamal Acharya
Contd..
• User application:
– This step involves designing and implementing end-user applications
since for the data warehouse to be useful there must be end-user
applications.
7/2/2019 55Compiled By: Kamal Acharya
Contd..
• Roll-out the warehouse and applications:
– Once the data warehouse has been populated and the end-user
applications tested , the warehouse system and the applications may be
rolled out for the user community to use.
7/2/2019 56Compiled By: Kamal Acharya
Guidelines for data warehouse implementation
• The following are the general guide lines for successful implementation of
a data warehouse, not each of them will be applicable to every data
warehouse project.
– Build incrementally
– Need a champion
– Senior management support
– Ensure quality
– Corporate strategy
– Business plan
– Training
– Adaptability
– Joint management
7/2/2019 57Compiled By: Kamal Acharya
Homework
• Why do many enterprises need a data warehouse?
• How is data warehouse different from a database? How are they
similar?
• What is ODS and what is it used for?
• State and explain the major steps involved in the ETL process.
• What is the major difference between the star schema and the
snowflake schema.
• Explain the implementation steps of data warehouse.
• Explain the guidelines for implementing a data warehouse.
7/2/2019 58Compiled By: Kamal Acharya
Contd..
• Describe how a data warehouse is modeled and implemented using
different schemas for data warehouse. Explain using an example.
• What do you mean by business analysis framework for data warehouse
design.
• Explain in detail about the data warehouse design process.
• From the software engineering point of view what are the different steps in
the design and construction of a data warehouse.
7/2/2019 59Compiled By: Kamal Acharya
Thank You !
Compiled By: Kamal Acharya 607/2/2019

Data Warehousing

  • 1.
  • 2.
    Introduction • Data warehousingis a process for assembling and managing data from various sources for the purpose of gaining a single detailed view of an enterprise. • A data warehouse is an integrated subject oriented, time variant and non-volatile repository of information in support of management’s decision making process. -Bill Inmon(1992) 7/2/2019 2Compiled By: Kamal Acharya
  • 3.
    Contd.. • The benefitsof implementing a data warehouse are as follows: – To provide a single version of truth about enterprise information. – To speed up query involving aggregations across multiple attributes. – To provide a system in which managers who do not have a strong technical background are able to run complex queries. – To provide a database that stores relatively clean data. – To provide a database that sores historical data. 7/2/2019 3Compiled By: Kamal Acharya
  • 4.
    Contd.. • Difference betweenOLTP and data warehouse systems: 7/2/2019 4Compiled By: Kamal Acharya Property OLTP Data warehouse Nature of database 2D tables Multidimensional (cube) Indexes Few many Joins Many Few Duplicated data Normalized data de-normalized data Derived data and aggregates Rare Common Queries Mostly pre-defined and simple Mostly ad-hoc and complex Updates All the times Only refreshed Historical data Often not available essential
  • 5.
    Operational data stores •An operational data store is a subject oriented, integrated, volatile, current valued data store, containing only corporate detailed data. • It is designed to provide a consolidated view of the enterprise’s current operational information. 7/2/2019 5Compiled By: Kamal Acharya
  • 6.
    Contd.. • The ODSis subject-oriented: – That is, it is organized around the major data subjects of an enterprise – E.g., in a university, the subjects might be students, lecturers and courses while in a company subjects might be customers, sales and products. 7/2/2019 6Compiled By: Kamal Acharya
  • 7.
    Contd.. • The ODSis integrated: – That is, it is a collection of subject –oriented data from a variety of systems to provide an enterprise-wide view of the data. 7/2/2019 7Compiled By: Kamal Acharya
  • 8.
    Contd.. • The ODSis current valued data: – That is, an ODS is up-to-date and reflects the current status of the information. – An ODS does not include historical data. 7/2/2019 8Compiled By: Kamal Acharya
  • 9.
    Contd.. • The ODSis volatile: – That is, the data in the ODS changes frequently as new information refreshes the ODS. 7/2/2019 9Compiled By: Kamal Acharya
  • 10.
    Contd.. • The ODSis detailed: – That is, the ODS is detailed enough to serve the needs of the operational management staff in the enterprise. 7/2/2019 10Compiled By: Kamal Acharya
  • 11.
    Contd.. • An ODSmay also be used as an interim database for a data warehouse. • An ODS may be viewed as an enterprise’s short term memory in that it stores only very recent information. • Benefits: – Improved access to important operational data. – Assist in better understanding of the business and the customer. – More efficient in generating current report without having to access the OLTP systems. – Shorten the time required to implement a data warehouse. 7/2/2019 11Compiled By: Kamal Acharya
  • 12.
    Contd…. Fig: Typical architectureof operational data stores 7/2/2019 12Compiled By: Kamal Acharya ETL ETL ETL
  • 13.
    Contd.. • Why separateODS? – It is possible to carry out the ODS queries on the existing OLTP systems but the OLTP system have to provide a quick response to operational users and business cannot afford to have response time suffer when a manager is running a complex query. – Time to time complex queries degrade the performance of the OLTP systems. – So, to make query processing more efficiently without loading OLTP systems, a separate ODS is required. 7/2/2019 13Compiled By: Kamal Acharya
  • 14.
    Contd.. • Comparison ofthe ODS and data warehouse: 7/2/2019 14Compiled By: Kamal Acharya ODS DW Data of high quality at detailed level and assured availability Data may not be perfect, but sufficient for strategic analysts, data does not have to be highly available. Contain current and near-current data Contains historical data Real time and near real time data loads Normally batch data loads Typically detailed data only Contains summarized and detailed data Modeled to support rapid data updates Modeled to optimize query performance Transaction similar to those in OLTP systems Complex queries processing larger volumes of data Used for operational reporting Used for management reporting
  • 15.
    Contd.. • Relationship betweenOLTP, ODS and DW systems: 7/2/2019 15Compiled By: Kamal Acharya OLTP DWODS
  • 16.
    Multi-Tiered Architecture Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Datamining Monitor & Integrator Metadata Data Sources Front-End Tools Serve Data Marts Operational DBs other sources Data Storage OLAP Server 7/2/2019 Compiled By: Kamal Acharya 16
  • 17.
    ETL(Extract, Transform, Load) •ETL, Short for extract, transform, and load are the database functions. • ETL is used to migrate data from one database to another, to form data marts and data warehouses and also to convert databases from one format or type to another. 7/2/2019 17Compiled By: Kamal Acharya
  • 18.
    Contd.. • To getdata out of the source and load it into the data warehouse: – Data is extracted from an OLTP database, transformed to match the data warehouse schema and loaded into the data warehouse database – Many data warehouses also incorporate data from non-OLTP systems such as text files, legacy systems, and spreadsheets; such data also requires extraction, transformation, and loading 7/2/2019 18Compiled By: Kamal Acharya
  • 19.
    Contd.. Fig: ETL process 7/2/201919Compiled By: Kamal Acharya
  • 20.
    • Extract Readsdata from a specified source and extracts a desired subset of data. • Data is extracted from heterogeneous data sources • Each data source has its distinct set of characteristics that need to be managed and integrated into the ETL system in order to effectively extract data. Data Extraction 7/2/2019 Compiled By: Kamal Acharya 20
  • 21.
    Data Transformation • Itis the main step where the ETL adds value. • Transform - Uses rules, or creating combinations with other data, to convert source data to the desired state. • convert data from format of operational system to format of data warehouse • Actually changes data and provides guidance whether data can be used for its intended purposes. 7/2/2019 Compiled By: Kamal Acharya 21
  • 22.
    Data Loading • Writestransformed data into the target warehouse and create indexes. • i.e., Data are physically moved to the data warehouse • The loading process can be broken down into 2 different types: • Initial Load • Continuous Load (loading over time) 7/2/2019 Compiled By: Kamal Acharya 22
  • 23.
    Data warehouse processes •There are four major processes that contribute to a data warehouse: – Extract and load the data. – Cleaning and transforming the data. – Backup and archive the data. – Managing queries and directing them to the appropriate data sources. 7/2/2019 23Compiled By: Kamal Acharya
  • 24.
    Contd.. • Extract andLoad Process – Data extraction takes data from the source systems. Data load takes the extracted data and loads it into the data warehouse. – Before loading the data into the data warehouse, the information extracted from the external sources must be reconstructed. 7/2/2019 24Compiled By: Kamal Acharya
  • 25.
    Contd.. • Clean andTransform Process: – Once the data is extracted and loaded into the temporary data store, it is time to perform Cleaning and Transforming. – Here is the list of steps involved in Cleaning and Transforming • Clean and transform the loaded data into a structure to speed up the queries. • Partition the data will optimize the hardware performance and simplify the management of data warehouse. • Aggregation to speed up common queries. 7/2/2019 25Compiled By: Kamal Acharya
  • 26.
    Contd.. • Backup andArchive the Data – In order to recover the data in the event of data loss, software failure, or hardware failure, it is necessary to keep regular back ups. – Archiving involves removing the old data from the system in a format that allow it to be quickly restored whenever required. 7/2/2019 26Compiled By: Kamal Acharya
  • 27.
    Contd.. • Query ManagementProcess – This process performs the following functions: • manages the queries. • helps speed up the execution time of queries. • directs the queries to their most effective data sources. • ensures that all the system sources are used in the most effective way. • monitors actual query profiles. 7/2/2019 27Compiled By: Kamal Acharya
  • 28.
    Data warehouse processmanagers and their functions • Process managers are responsible for maintaining the flow of data both into and out of the data warehouse. There are three different types of process managers: – Load manager – Warehouse manager – Query manager 7/2/2019 28Compiled By: Kamal Acharya
  • 29.
    Contd.. • Data WarehouseLoad Manager – Load manager performs the operations required to extract and load the data into the database. – The size and complexity of a load manager varies between specific solutions from one data warehouse to another. 7/2/2019 29Compiled By: Kamal Acharya
  • 30.
    Contd.. Fig: Load ManagerArchitecture 7/2/2019 30Compiled By: Kamal Acharya
  • 31.
    Contd.. • The loadmanager does performs the following functions: – Extract data from the source system. – Fast load the extracted data into temporary data store. – Perform simple transformations into structure similar to the one in the data warehouse. 7/2/2019 31Compiled By: Kamal Acharya
  • 32.
    Contd.. • Warehouse Manager –The warehouse manager is responsible for the warehouse management process. The size and complexity of a warehouse manager varies between specific solutions. – Warehouse Manager Architecture: A warehouse manager includes the following: – The controlling process – Stored procedures. – Backup/Recovery tool – SQL scripts 7/2/2019 32Compiled By: Kamal Acharya
  • 33.
    Contd.. Fig: Warehouse managerarchitecture 7/2/2019 33Compiled By: Kamal Acharya
  • 34.
    Contd.. • Functions ofWarehouse Manager: A warehouse manager performs the following functions : – Analyzes the data to perform consistency and referential integrity checks. – Creates indexes, business views, partition views against the base data. – Generates new aggregations and updates the existing aggregations. – Generates normalizations. – Transforms and merges the source data of the temporary store into the published data warehouse. – Backs up the data in the data warehouse. – Archives the data that has reached the end of its captured life. • Note − A warehouse Manager analyzes query profiles to determine whether the index and aggregations are appropriate. 7/2/2019 34Compiled By: Kamal Acharya
  • 35.
    Contd.. • Query Manager –The query manager is responsible for directing the queries to suitable tables. – By directing the queries to appropriate tables, it speeds up the query request and response process. – In addition, the query manager is responsible for scheduling the execution of the queries posted by the user. 7/2/2019 35Compiled By: Kamal Acharya
  • 36.
    Contd.. • Query ManagerArchitecture: A query manager includes the following components: – Query redirection via C tool or RDBMS – Stored procedures – Query management tool – Query scheduling via C tool or RDBMS 7/2/2019 36Compiled By: Kamal Acharya
  • 37.
    Contd.. Fig: Query managerarchitecture 7/2/2019 37Compiled By: Kamal Acharya
  • 38.
    Contd.. • Functions ofQuery Manager – It presents the data to the user in a form they understand. – It schedules the execution of the queries posted by the end-user. – It stores query profiles to allow the warehouse manager to determine which indexes and aggregations are appropriate. 7/2/2019 38Compiled By: Kamal Acharya
  • 39.
    Data warehouse design •There are a number of ways of conceptualizing a data warehouse – A three –level architecture(OLTP, central data warehouse, and data marts) – Another three-level architecture( OLTP, ODS, and data warehouse). • Whatever the architecture, a data warehouse needs to have a data model that can form the basis for implementing it. 7/2/2019 39Compiled By: Kamal Acharya
  • 40.
    Contd.. • The entity-relationshipdata model is commonly used in the design of relational databases, where a database schema consists of a set of entities and the relationships between them. Such a data model is appropriate for on-line transaction processing. • A data warehouse, however, requires a concise, subject-oriented schema that facilitates on-line data analysis. • The most popular data model for a data warehouse is a multidimensional model. Such a model can exist in the form of a star schema, a snowflake schema, or a fact constellation schema. 7/2/2019 40Compiled By: Kamal Acharya
  • 41.
    Contd.. • Star schema: –Consists of a central fact table and a set of surrounding dimension tables on which the facts depend. – a central fact table contains the keys to each dimensions. – The fact table also contains the attributes, e.g., dollars sold and units sold. – Each dimension in a star schema is represented with only one- dimension table. – This dimension table contains the set of attributes. 7/2/2019 41Compiled By: Kamal Acharya
  • 42.
    Contd.. • The followingdiagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location. 7/2/2019 42Compiled By: Kamal Acharya
  • 43.
    Contd.. • Snowflake Schema: –Star schemas may be refined into snowflake schemas if we wish to provide support for dimension hierarchies by allowing the dimension tables to have sub-tables to represent the hierarchies. – For example, the item dimension table in star schema is split into two dimension tables, namely item and supplier table as shown in figure below. 7/2/2019 43Compiled By: Kamal Acharya
  • 44.
  • 45.
    Contd.. • Fact constellation: – Sophisticated applications may require multiple fact tables to share dimension tables. – This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation. 7/2/2019 45Compiled By: Kamal Acharya
  • 46.
    Contd.. • The followingillustration shows two fact tables, namely Sales and shipping. Time, item, and location dimension tables are shared between the sales and shipping fact table. 7/2/2019 46Compiled By: Kamal Acharya
  • 47.
    Data warehouse Implementation •What ever the data warehouse architecture, building a data warehouse is likely to consists of the following steps. – Requirement analysis and capacity planning – Hardware integration – Modeling – Physical modeling – sources – ETL – Populate the data warehouse – User application – Roll-out the warehouse and applications 7/2/2019 47Compiled By: Kamal Acharya
  • 48.
    Contd.. • Requirement analysisand capacity planning: – The first step involves defining the enterprise needs, defining architecture, carrying out capacity planning, selecting the hardware and software tools. 7/2/2019 48Compiled By: Kamal Acharya
  • 49.
    Contd.. • Hardware integration: –This involves integrating the servers, the storage and the client tool. 7/2/2019 49Compiled By: Kamal Acharya
  • 50.
    Contd.. • Modeling: – Thisis a major step that involves designing the warehouse schema and views. 7/2/2019 50Compiled By: Kamal Acharya
  • 51.
    Contd.. • Physical modeling: –This involves designing the physical warehouse organization, and placement, partitioning, and access methods. 7/2/2019 51Compiled By: Kamal Acharya
  • 52.
    Contd.. • Sources: – Identifyingand connecting the sources using gateways, ODBC drivers, or other wrappers. 7/2/2019 52Compiled By: Kamal Acharya
  • 53.
    Contd.. • ETL: – Designingand implementing the ETL process. This may involve identifying a suitable ETL tool vendor and purchasing and implementing the tool. 7/2/2019 53Compiled By: Kamal Acharya
  • 54.
    Contd.. • Populate thedata warehouse: – After ETL, populating the warehouse with the schema and view definition. 7/2/2019 54Compiled By: Kamal Acharya
  • 55.
    Contd.. • User application: –This step involves designing and implementing end-user applications since for the data warehouse to be useful there must be end-user applications. 7/2/2019 55Compiled By: Kamal Acharya
  • 56.
    Contd.. • Roll-out thewarehouse and applications: – Once the data warehouse has been populated and the end-user applications tested , the warehouse system and the applications may be rolled out for the user community to use. 7/2/2019 56Compiled By: Kamal Acharya
  • 57.
    Guidelines for datawarehouse implementation • The following are the general guide lines for successful implementation of a data warehouse, not each of them will be applicable to every data warehouse project. – Build incrementally – Need a champion – Senior management support – Ensure quality – Corporate strategy – Business plan – Training – Adaptability – Joint management 7/2/2019 57Compiled By: Kamal Acharya
  • 58.
    Homework • Why domany enterprises need a data warehouse? • How is data warehouse different from a database? How are they similar? • What is ODS and what is it used for? • State and explain the major steps involved in the ETL process. • What is the major difference between the star schema and the snowflake schema. • Explain the implementation steps of data warehouse. • Explain the guidelines for implementing a data warehouse. 7/2/2019 58Compiled By: Kamal Acharya
  • 59.
    Contd.. • Describe howa data warehouse is modeled and implemented using different schemas for data warehouse. Explain using an example. • What do you mean by business analysis framework for data warehouse design. • Explain in detail about the data warehouse design process. • From the software engineering point of view what are the different steps in the design and construction of a data warehouse. 7/2/2019 59Compiled By: Kamal Acharya
  • 60.
    Thank You ! CompiledBy: Kamal Acharya 607/2/2019