1
Data warehousing and
Data Mining
Dr.R.U.Anitha
2
Chapter # 2
Data Warehouse Architecture & Models
Text Book: Modern Data Warehousing, Mining,
and Visualization: Core Concepts
by George M. Marakas
3
 Learning Objectives
 The main objective of this chapter is to let the
students know the basic structure of a Data
warehouse.
 Learn about Data warehouse physical
architecture and various principles of a Data
warehousing.
 To make students understand a data warehouse
model.
 It also explains various types of
multidimensional models and Schemas.
4
 Major Topics
 Structure of a Data Warehouse
 Data Warehouse Architectures
 Three Data Warehouse Models
 Extraction, Transformation and Loading (ETL)
 Design of Data Warehouse
 Multidimensional Data Model
 Data cube
 Conceptual Modeling of Data Warehouses
Dimension Schema
5
Structure of a Data Warehouse
 Essentially, a data warehouse provides historical
data for decision-support applications.
 Such applications include reporting, online
analytical processing (OLAP), executive
information systems (EIS), and data mining.
 Data warehouse is a centralized, integrated
repository of information.
6
 Here, integrated means cleaned up, merged, and
redesigned. This may be more or less
complicated
 Depending on how many systems feed into a
warehouse and how widely they differ in
handling similar information.
 But most companies already have repositories of
information in their production systems and
many of them are centralized.
7
Data Warehouse Architectures: Conceptual
View
Single-layer
–Every data element is stored once only
–Virtual warehouse
8
Two-layer
 Real-time + derived data
–Most
commonly used approach in industry today
9
Three-layer Architecture: Conceptual View
 Transformation of real-time data to derived data
really requires two steps
10
The architecture consists of various
interconnected elements:
 Operational and external database layer –
the source data for the DW
 Information access layer – the tools the end
user access to extract and analyze the data
 Data access layer – the interface between the
operational and information access layers
 Metadata layer – the data directory or
repository of metadata information
11
Additional layers are:
 Process management layer – the scheduler or
job controller
 Application messaging layer – the
“middleware” that transports information
 around the firm
 Physical data warehouse layer – where the
actual data used in the DSS are located
 Data staging layer – all of the processes
necessary to select, edit, summarize and load
warehouse data from the operational and
external data bases
12
Components of the Data Warehouse
Architecture
13
Data Warehouse Architecture
14
Three Data Warehouse Models
 Enterprise Warehouse- Collects all of the
information about subjects spanning the entire
organization.
 The Central data warehouse- A single physical
database contains all of the data for a specific
functional area.
 The distributed data warehouse- The
components are distributed across several
physical databases.
15
Extraction, Transformation and Loading (ETL)
 Data extraction
- get data from multiple, heterogeneous, and external
sources
 Data cleaning
–detect errors in the data and rectify them when possible
 Data transformation
–convert data from legacy or host format to warehouse
format
 Loading
–sort, summarize, consolidate, compute views, check
integrity, and build indices and partitions
 Refresh
–propagate the updates from the data sources to the
warehouse
16
Design of Data Warehouse: A Business Analysis
Framework
Four views regarding the design of a data warehouse
 Top-down view
It allows selection of the relevant information necessary for
the data warehouse
 Data source view
It exposes the information being captured, stored, and
managed by operational systems
 Data warehouse view
It consists of fact tables and dimension tables
 Business query view
It shows the perspectives of data in the warehouse from the
view of end-user.
17
Multidimensional Data Model
From Tables and Spreadsheets to Data Cubes
 A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions
Dimension tables[Eg. item (item_name, brand, type), or
time(day, week, month, quarter, year) ]
Fact table contains measures (such as dollars_sold) and
keys to each of the related dimension tables
 In data warehousing literature, an n-D base cube is
called a base cuboid. The top most 0-D cuboid, which
holds the highest-level of summarization, is called the
apex cuboid. The lattice of cuboids forms a data cube.
18
19
Data Cube
 n-dimensional cube C[A1,A2,…An] is a database
with n dimensions as A1,A2…An each of which
represents a theme and contains |Ai| number of
distinct elements in the dimension Ai.
 Each distinct element of Ai corresponds to a data
row of C.
 A data cell in the cube C[a1,a2,..an] stores the
numeric measures of the data for Ai=ai, for all i.
 Eg. Data cube C[place, product, year]
 Data Cell [Peshawar, TV, 4Qtr] stores 314 as
numeric measure which represents three
dimensions with 3,1 and 4rows respectively.
20
 A Sample Data Cube
21
22
Typical Data cube OLAP Operations
 Roll up (drill-up): Summarize data by climbing
up hierarchy or by dimension reduction
 Drill down(roll down): Reverse of roll-up
 From higher level summary to lower level
summary or detailed data, or introducing new
dimensions
 Slice and Dice: Project and select
 Pivot (rotate): reorient the cube visualization,
3D to series of 2D planes
 Drill across: involving (across) more than one
fact table
 Drill through: through the bottom level of the
cube to its back-end relational tables (using SQL)
23
24
Conceptual Modeling of Data Warehouses
Dimension Schema
 Modeling data warehouses: dimensions &
measures
 Starschema: A fact table in the middle
connected to a set of dimension tables
 Snowflakeschema: A refinement of star schema
where some dimensional hierarchy is
normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
 Factconstellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact
constellation
25
Star Schema
 Easy for users to understand
 Fast response to queries
 Simple metadata
 Supported by many front end tools
 Less robust to change
 Slower to build
 Does not support history
26
Example of STAR Schema
27
Snowflake Schema
 Direct use by some tools
 More flexible to change
 Provides for speedier data loading
 May become large and unmanageable
 Degrades query performance
 More complex metadata
28
29
Fact Constellation
 Measure of online analytical processing
 Collection of multiple fact tables
 Sharing dimension tables
 Viewed as a collection of stars.
 Improvement over Star schema.
30

CHAPTER 2 - Datawarehouse Architecture.pptx

  • 1.
    1 Data warehousing and DataMining Dr.R.U.Anitha
  • 2.
    2 Chapter # 2 DataWarehouse Architecture & Models Text Book: Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas
  • 3.
    3  Learning Objectives The main objective of this chapter is to let the students know the basic structure of a Data warehouse.  Learn about Data warehouse physical architecture and various principles of a Data warehousing.  To make students understand a data warehouse model.  It also explains various types of multidimensional models and Schemas.
  • 4.
    4  Major Topics Structure of a Data Warehouse  Data Warehouse Architectures  Three Data Warehouse Models  Extraction, Transformation and Loading (ETL)  Design of Data Warehouse  Multidimensional Data Model  Data cube  Conceptual Modeling of Data Warehouses Dimension Schema
  • 5.
    5 Structure of aData Warehouse  Essentially, a data warehouse provides historical data for decision-support applications.  Such applications include reporting, online analytical processing (OLAP), executive information systems (EIS), and data mining.  Data warehouse is a centralized, integrated repository of information.
  • 6.
    6  Here, integratedmeans cleaned up, merged, and redesigned. This may be more or less complicated  Depending on how many systems feed into a warehouse and how widely they differ in handling similar information.  But most companies already have repositories of information in their production systems and many of them are centralized.
  • 7.
    7 Data Warehouse Architectures:Conceptual View Single-layer –Every data element is stored once only –Virtual warehouse
  • 8.
    8 Two-layer  Real-time +derived data –Most commonly used approach in industry today
  • 9.
    9 Three-layer Architecture: ConceptualView  Transformation of real-time data to derived data really requires two steps
  • 10.
    10 The architecture consistsof various interconnected elements:  Operational and external database layer – the source data for the DW  Information access layer – the tools the end user access to extract and analyze the data  Data access layer – the interface between the operational and information access layers  Metadata layer – the data directory or repository of metadata information
  • 11.
    11 Additional layers are: Process management layer – the scheduler or job controller  Application messaging layer – the “middleware” that transports information  around the firm  Physical data warehouse layer – where the actual data used in the DSS are located  Data staging layer – all of the processes necessary to select, edit, summarize and load warehouse data from the operational and external data bases
  • 12.
    12 Components of theData Warehouse Architecture
  • 13.
  • 14.
    14 Three Data WarehouseModels  Enterprise Warehouse- Collects all of the information about subjects spanning the entire organization.  The Central data warehouse- A single physical database contains all of the data for a specific functional area.  The distributed data warehouse- The components are distributed across several physical databases.
  • 15.
    15 Extraction, Transformation andLoading (ETL)  Data extraction - get data from multiple, heterogeneous, and external sources  Data cleaning –detect errors in the data and rectify them when possible  Data transformation –convert data from legacy or host format to warehouse format  Loading –sort, summarize, consolidate, compute views, check integrity, and build indices and partitions  Refresh –propagate the updates from the data sources to the warehouse
  • 16.
    16 Design of DataWarehouse: A Business Analysis Framework Four views regarding the design of a data warehouse  Top-down view It allows selection of the relevant information necessary for the data warehouse  Data source view It exposes the information being captured, stored, and managed by operational systems  Data warehouse view It consists of fact tables and dimension tables  Business query view It shows the perspectives of data in the warehouse from the view of end-user.
  • 17.
    17 Multidimensional Data Model FromTables and Spreadsheets to Data Cubes  A data warehouse is based on a multidimensional data model which views data in the form of a data cube  A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables[Eg. item (item_name, brand, type), or time(day, week, month, quarter, year) ] Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables  In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
  • 18.
  • 19.
    19 Data Cube  n-dimensionalcube C[A1,A2,…An] is a database with n dimensions as A1,A2…An each of which represents a theme and contains |Ai| number of distinct elements in the dimension Ai.  Each distinct element of Ai corresponds to a data row of C.  A data cell in the cube C[a1,a2,..an] stores the numeric measures of the data for Ai=ai, for all i.  Eg. Data cube C[place, product, year]  Data Cell [Peshawar, TV, 4Qtr] stores 314 as numeric measure which represents three dimensions with 3,1 and 4rows respectively.
  • 20.
    20  A SampleData Cube
  • 21.
  • 22.
    22 Typical Data cubeOLAP Operations  Roll up (drill-up): Summarize data by climbing up hierarchy or by dimension reduction  Drill down(roll down): Reverse of roll-up  From higher level summary to lower level summary or detailed data, or introducing new dimensions  Slice and Dice: Project and select  Pivot (rotate): reorient the cube visualization, 3D to series of 2D planes  Drill across: involving (across) more than one fact table  Drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
  • 23.
  • 24.
    24 Conceptual Modeling ofData Warehouses Dimension Schema  Modeling data warehouses: dimensions & measures  Starschema: A fact table in the middle connected to a set of dimension tables  Snowflakeschema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake  Factconstellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
  • 25.
    25 Star Schema  Easyfor users to understand  Fast response to queries  Simple metadata  Supported by many front end tools  Less robust to change  Slower to build  Does not support history
  • 26.
  • 27.
    27 Snowflake Schema  Directuse by some tools  More flexible to change  Provides for speedier data loading  May become large and unmanageable  Degrades query performance  More complex metadata
  • 28.
  • 29.
    29 Fact Constellation  Measureof online analytical processing  Collection of multiple fact tables  Sharing dimension tables  Viewed as a collection of stars.  Improvement over Star schema.
  • 30.