2. Characteristics of a Data Warehouse
A data warehouse is a database designed for
querying, reporting, and analysis.
A data warehouse contains historical data
derived from transaction data.
Data warehouses separate analysis workload
from transaction workload.
A data warehouse is primarily
an analytical tool.
3. Comparing OLTP and Data Warehouses
OLTP Data Warehouse
Many Joins Some
Comparatively Data accessed by Large
lower queries amount
Normalized Duplicated data Denormalized
DBMS DBMS
Derived data
Rare and Common
aggregates
4. Data Warehouse Architectures
Analysis
Operational
systems
Metadata Sales
Purchasing
Materialized
Raw data
Staging views
area Reporting
Inventory
Flat files Data mining
5. Data Warehouse Design
• Key data warehouse design considerations:
– Identify the specific data content.
– Recognize the critical relationships within and
between groups of data.
– Define the system environment
supporting your data warehouse.
– Identify the required data
transformations.
– Calculate the frequency at which
the data must be refreshed.
6. Logical Design
– A logical design is conceptual and
abstract.
– Entity-relationship (ER) modeling
is useful in identifying logical
information requirements.
• An entity represents a chunk of data.
• The properties of entities are known as attributes.
• The links between entities and attributes are known
as relationships.
– Dimensional modeling is a specialized
type of ER modeling useful in data warehouse
design.
7. Oracle Warehouse Builder
– Oracle Database provides tools to implement
the ETL process.
• Oracle Warehouse Builder is a tool to help in this
process.
– Oracle Warehouse Builder generates the
following types of code:
• SQL data definition language (DDL) scripts
• PL/SQL programs
• SQL*Loader control files
• XML Processing Description Language (XPDL)
• ABAP code (used to extract data from SAP systems)
8. Data Warehousing Schemas
– Objects can be arranged in data warehousing
schema models in a variety of ways:
• Star schema
• Snowflake schema
• Third normal form (3NF) schema
• Hybrid schemas
– The source data model and user
requirements should steer the data
warehouse schema.
– Implementation of the logical model may
require changes to enable you to adapt it to
your physical system.
9. Schema Characteristics
– Star schema
• Characterized by one or more large fact tables and a
number of much smaller dimension tables
• Each dimension table joined to the fact table using a
primary key to foreign key join
– Snowflake schema
• Dimension data grouped into multiple tables instead
of one large table
• Increased number of dimension tables, requiring
more foreign key joins
– Third normal form (3NF) schema
• A classical relational-database model that minimizes
data redundancy through normalization
10. Data Warehousing Objects
– Fact tables
• Fact tables are the large tables that store business
measurements.
– Dimension tables
• A dimension is a structure composed of one or more
hierarchies that categorizes data.
• Unique identifiers are specified for one distinct
record in a dimension table.
– Relationships
• Relationships guarantee
integrity of business
information.
11. Fact Tables
– A fact table must be defined for each star schema.
– Fact tables are the large tables that store business
measurements.
– A fact table contains either detail-level or
aggregated facts.
– A fact table usually contains facts with the same
level of aggregation.
– The primary key of the fact table is
usually a composite key made up
of all its foreign keys.
12. Dimensions and Hierarchies
CUSTOMERS dimension
– A dimension is a structure hierarchy (by level)
composed of one or more
hierarchies that categorizes data. REGION
– Dimensional attributes help to
describe the dimensional value. SUBREGION
– Dimension data is collected at the
lowest level of detail and aggregatedCOUNTRY
into higher level totals.
– Hierarchies are structures that use STATE
ordered levels to organize data.
– In a hierarchy, each level is CITY
connected to the levels above and
below it. CUSTOMER
15. Data Warehouse Physical Structures
• Tables and partitioned tables
– Partitioned tables enable you to split
large data volumes into smaller,
more manageable pieces.
– Expect performance benefits from:
• Partition pruning
• Intelligent parallel processing
– Compressed tables offer scaleup opportunities
for read-only operations.
– Table compression saves disk space.
16. Data Warehouse Physical Structures
– Views:
• Are tailored presentations of data contained in one
or more tables or views
• Do not require any space in the database
– Materialized views:
• Are query results that have been stored in advance
• (Like indexes) are used transparently and improve
performance
– Integrity constraints:
• Are used in data warehouses for query rewrite
– Dimensions:
• Are containers of logical relationships and do not
require any space in the database
17. Managing Large Volumes of Data
• Work smarter in your data warehouse:
– Partitioning
– Bitmap indexes/Star transformation
– Data compression
– Query rewrite
• Work harder in your data warehouse:
– Parallelism for all operations
• DBA tasks, such as loading, index creation, table
creation, data modification, backup and recovery
• End-user operations, such as queries
• Unbounded scalability: Real Application Clusters
18. I/O Performance in Data Warehouses
– I/O is typically the primary determinant of data
warehouse performance.
– Data warehouse storage configurations should be
chosen by I/O bandwidth, not storage capacity.
– Every component of the I/O
subsystem should provide
enough bandwidth:
• Disks
• I/O channels
• I/O adapters
– In data warehouses, maximizing
sequential I/O throughput is critical.
19. I/O Scalability
Parallel execution:
– Reduces response time for data-intensive operations on large
databases
– Benefits systems with the following characteristics:
• Multiprocessors, clusters, or massively parallel systems
• Sufficient I/O bandwidth
• Sufficient memory to support memory-intensive processes such
as sorts, hashing, and I/O buffers
Query servers
Coordinator
Data on disk Scan Sort Q1
Scan Sort Q2
Dispatch
work
Scan Sort Q3
Scan Sort Q4
Scanners Sorters (Aggregators)
20. I/O Scalability
• Automatic Storage Management (ASM)
– Configuring storage for a DB depends on many
variables:
• Which data to put on which disk
• Logical unit number (LUN) configurations
• DB types and workloads; data warehouse, OLTP, DSS
• Trade-offs between available options
– ASM provides solutions to storage issues
encountered in data warehouses.
21. I/O Scalability
• Automatic Storage Management: Overview
– Portable and high-performance
cluster file system Application
– Manages Oracle database files
– Data spread across disks Database
to balance load File
– Integrated mirroring across system
ASM
disks Volume
manager
– Solves many storage
management challenges Operating system
22. Visit more self help tutorials
• Pick a tutorial of your choice and browse
through it at your own pace.
• The tutorials section is free, self-guiding and
will not involve any additional support.
• Visit us at www.dataminingtools.net