Shikha Gautam
Asst.Professor
 The Data warehouse is an environment, not a product.
 It is an architectural construct of an information system that
provides users with current and historical decision support
information that is hard to access or present in traditional
operational data store.
 Data warehousing is a blend of technologies and components
aimed at effective integration of operation database into an
environment that enables strategic use of data.
 These technologies include relational and multi-dimensional
database management system, client/ server architecture,
meta-data modeling and repositories, graphical user interface
etc.
“A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
 Data warehousing is the process of constructing and using
data warehouses.
 Focusing on the modeling and analysis of data for decision makers,
not on daily operations or transaction processing.
 Organized around major subjects, such as customer, product,
sales.
 Provide a simple and concise view around particular subject.
 The time horizon for the data warehouse is significantly
longer than that of operational systems:
 Operational database: current value data
 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse contains an
element of time, explicitly or implicitly.
 A physically separate store of data transformed from the
operational environment.
 Does not require transaction processing, recovery, and
concurrency control mechanisms
 Requires only two operations in data accessing:
▪ initial loading of data and access of data
DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery.
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Data Warehouse: A Multi-Tiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources Front-End Tools
Serve
Data Marts
Operational
DBs
Other
sources
Data Storage
OLAP Server
 A data warehouse is based on a multidimensional data model
which views data in the form of a data cube.
 A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions
 Dimension tables, such as item (item_name, brand, type),
or time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and
keys to each of the related dimension tables
time,item
time,item,location
time, item, location, supplier
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
 An n-D base cube is called a base cuboid.
 The top most 0-D cuboid, which holds the
highest-level of summarization, is called the
apex cuboid.
 The lattice of cuboids forms a data cube.
 Data warehouse have different schemas as:
1. Star schema: A fact table in the middle connected
to a set of dimension tables.
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
state_or_province
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
2. Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake.
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
state_or_province
country
city
3. Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema.
31
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_state
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper
 Distributive: if the result derived by applying the function to n
aggregate values is the same as that derived by applying the
function on all the data without partitioning.
▪ E.g., count(), sum(), min(), max()
 Algebraic: if it can be computed by an algebraic function with
M arguments (where M is a bounded integer).
 E.g., avg(), min N(), standard_deviation()
 Holistic: if there is no constant bound on the storage size
needed to describe a sub-aggregate.
▪ E.g., median(), mode(), rank()
There are following seven components of a Data
Warehouse:
 Data Warehouse Database
 Sourcing, Acquisition, Cleanup andTransformation
Tools
 Meta Data
 Access (Query)Tools :The query tool allows executives
and other users real-time access to the DataWarehouse
database for query generation, result displays, reports
and data exports.
 Data Marts
 Data Warehouse Administration and Management
 Information Delivery System
1. Business Considerations- Identifying and establishing
information requirements.
2. Design Considerations- Data Content, Metadata, Data
Distribution, Tools, Performance consideration.
3. Technical Considerations- Hardware Platform, DBMS that
supports the warehouse, Communication infrastructure,
hardware platform and software to support the metadata
repository, Systems management framework.
4. Implementation Considerations- Access Tools, Data
Extraction, Cleanup, Transformation, and Migration etc.
The size of a data warehouse rapidly approaches to the
search for better performance. This search aims to pursue
two goals:
 Speed-up: the ability to execute the same request on
the same amount data in less time.
 Scale-up: the ability to obtain the same performance
on the same request as the database size increases.
Doubling the number of processors cuts the response
time in half (linear speed-up) or provides the same
performance on twice as much data (linear scale-up).
 The goals of linear performance and scalability can be
satisfied by parallel hardware architectures, parallel
operating systems, and parallel DBMSs.
 Parallel hardware architectures are based on Multi-
processor systems designed as a Shared-memory model
(symmetric multiprocessors), Shared-disk model or
distributed-memory model (MPP and Clusters of SMPs).
Parallelism can be achieved in two different ways:
1. Horizontal Parallelism (Database is partitioned across
different disks)
2. Vertical Parallelism (occurs among different tasks – all
components query operations i.e. scans, join, sort)
3. Data Partitioning.
 Shared memory -- processors share a
common memory.
 Shared disk -- processors share a common
disk.
 Shared nothing -- processors share neither a
common memory nor common disk.
 Hierarchical -- hybrid of the above
architectures.Also known as Combined
Architecture .
 ORACLE – Oracle supports Parallel Database processing with
its add-on.
 Informix –Support Shared-Memory, Shared-Disk, and
Shared-Nothing Models.
 IBM – DB2 Parallel Edition (DB2 PE), DB2 Universal
Database.
 Sybase – Sybase implemented its parallel DBMS functionality in
a product called SYBASE MPP.
 Other RDBMS Products : i. NCR Teradata ii. Tandem NonStop
SQL/MP.

Data Warehousing

  • 1.
  • 2.
     The Datawarehouse is an environment, not a product.  It is an architectural construct of an information system that provides users with current and historical decision support information that is hard to access or present in traditional operational data store.  Data warehousing is a blend of technologies and components aimed at effective integration of operation database into an environment that enables strategic use of data.  These technologies include relational and multi-dimensional database management system, client/ server architecture, meta-data modeling and repositories, graphical user interface etc.
  • 3.
    “A data warehouseis a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon  Data warehousing is the process of constructing and using data warehouses.  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.
  • 4.
     Organized aroundmajor subjects, such as customer, product, sales.  Provide a simple and concise view around particular subject.
  • 5.
     The timehorizon for the data warehouse is significantly longer than that of operational systems:  Operational database: current value data  Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)  Every key structure in the data warehouse contains an element of time, explicitly or implicitly.
  • 6.
     A physicallyseparate store of data transformed from the operational environment.  Does not require transaction processing, recovery, and concurrency control mechanisms  Requires only two operations in data accessing: ▪ initial loading of data and access of data
  • 7.
    DBMS— tuned forOLTP: access methods, indexing, concurrency control, recovery. Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.
  • 8.
    OLTP OLAP users clerk,IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated historical, summarized, multidimensional integrated, consolidated usage repetitive ad-hoc access read/write index/hash on prim. key lots of scans unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response
  • 9.
    Data Warehouse: AMulti-Tiered Architecture Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Monitor & Integrator Metadata Data Sources Front-End Tools Serve Data Marts Operational DBs Other sources Data Storage OLAP Server
  • 23.
     A datawarehouse is based on a multidimensional data model which views data in the form of a data cube.  A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions  Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)  Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables
  • 24.
    time,item time,item,location time, item, location,supplier all time item location supplier time,location time,supplier item,location item,supplier location,supplier time,item,supplier time,location,supplier item,location,supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid
  • 25.
     An n-Dbase cube is called a base cuboid.  The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid.  The lattice of cuboids forms a data cube.
  • 26.
     Data warehousehave different schemas as: 1. Star schema: A fact table in the middle connected to a set of dimension tables.
  • 27.
  • 28.
    2. Snowflake schema:A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake.
  • 29.
  • 30.
    3. Fact constellations:Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema.
  • 31.
  • 32.
     Distributive: ifthe result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning. ▪ E.g., count(), sum(), min(), max()  Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer).  E.g., avg(), min N(), standard_deviation()  Holistic: if there is no constant bound on the storage size needed to describe a sub-aggregate. ▪ E.g., median(), mode(), rank()
  • 33.
    There are followingseven components of a Data Warehouse:  Data Warehouse Database  Sourcing, Acquisition, Cleanup andTransformation Tools  Meta Data  Access (Query)Tools :The query tool allows executives and other users real-time access to the DataWarehouse database for query generation, result displays, reports and data exports.  Data Marts  Data Warehouse Administration and Management  Information Delivery System
  • 34.
    1. Business Considerations-Identifying and establishing information requirements. 2. Design Considerations- Data Content, Metadata, Data Distribution, Tools, Performance consideration. 3. Technical Considerations- Hardware Platform, DBMS that supports the warehouse, Communication infrastructure, hardware platform and software to support the metadata repository, Systems management framework. 4. Implementation Considerations- Access Tools, Data Extraction, Cleanup, Transformation, and Migration etc.
  • 35.
    The size ofa data warehouse rapidly approaches to the search for better performance. This search aims to pursue two goals:  Speed-up: the ability to execute the same request on the same amount data in less time.  Scale-up: the ability to obtain the same performance on the same request as the database size increases. Doubling the number of processors cuts the response time in half (linear speed-up) or provides the same performance on twice as much data (linear scale-up).
  • 36.
     The goalsof linear performance and scalability can be satisfied by parallel hardware architectures, parallel operating systems, and parallel DBMSs.  Parallel hardware architectures are based on Multi- processor systems designed as a Shared-memory model (symmetric multiprocessors), Shared-disk model or distributed-memory model (MPP and Clusters of SMPs). Parallelism can be achieved in two different ways: 1. Horizontal Parallelism (Database is partitioned across different disks) 2. Vertical Parallelism (occurs among different tasks – all components query operations i.e. scans, join, sort) 3. Data Partitioning.
  • 37.
     Shared memory-- processors share a common memory.  Shared disk -- processors share a common disk.  Shared nothing -- processors share neither a common memory nor common disk.  Hierarchical -- hybrid of the above architectures.Also known as Combined Architecture .
  • 38.
     ORACLE –Oracle supports Parallel Database processing with its add-on.  Informix –Support Shared-Memory, Shared-Disk, and Shared-Nothing Models.  IBM – DB2 Parallel Edition (DB2 PE), DB2 Universal Database.  Sybase – Sybase implemented its parallel DBMS functionality in a product called SYBASE MPP.  Other RDBMS Products : i. NCR Teradata ii. Tandem NonStop SQL/MP.