Motivation: Why data warehouse?
What is a data warehouse?
Why separate DW?
Conceptual modeling of DW
Data Warehousing Architectures
Data Warehousing and OLAP Data Warehouse Development
Data Warehouse Vendors
Yudho Giri Sucahyo, Ph.D, CISA (email@example.com) Real-time DW
Faculty of Computer Science, University of Indonesia
Motivation: Why data warehouse? What is a data warehouse? [JH]
Construction of data warehouses (DW) involves data Defined in many different ways, but not rigorously.
cleaning and data integration important A decision support database that is maintained separately
preprocessing step for data mining (DM). from the organization’s ODB.
DW provide OLAP for the interactive analysis of Support information processing by providing a solid platform
of consolidated, historical data for analysis.
multidimensional data, which facilitates effective DM.
“A data warehouse is a subject-oriented, integrated,
Data mining functions can be integrated with OLAP
time-variant, and nonvolatile collection of data in
operations to enhance interactive mining of knowledge.
support of management’s decision-making process.” —
DW will provide an effective platform for DM. W. H. Inmon
Whil DWs are not requirements to do DM, DW store
t i t t d DM t Case Study 2: Continental Airlines flies high with its
massive amounts of data that can be uses for DM. [DO] real-time data warehouse
What is a data warehouse? [ET] Subject Oriented
Organized around major subjects, such as
A physical repository where relational data are specially
customer, product, sales.
organized to provide enterprise-wide, cleansed data in a
standardized format. Provide i l
P id a simple and concise view around
d i i d
Characteristics particular subject issues by excluding data that
Subject oriented, Integrated, Time Variant, Non-volatile are not useful in the decision support process.
Web-based, Relational/multidimensional, Client/server, Real-time
Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
Process of constructing and using data warehouses. transaction processing
Requires data integration, data cleaning, and data consolidation.
Integrated Time Variant
Integrate multiple, heterogeneous data sources The time horizon for the data warehouse is significantly
Relational databases, flat-files, on-line transaction records longer than that of operational systems.
Data cleaning and data integration techniques are
g g q Operational database: current value data.
applied Data warehouse data: provide information from a historical
Ensure consistency in naming conventions, encoding perspective (e g past 5-10 years)
structures, attribute measures, etc. among different data
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
E.g., Hotel price: currency, tax, breakfast covered, etc.
But the key of operational data may or may not contain “time
When d i
Wh data is moved to the warehouse, it is converted.
d h h i i d element”.
Non volatile Data Warehouse vs Heterogeneous DBMS
A physically separate store of data transformed from the
p y y p Traditional heterogeneous DB integration:
Build wrappers/mediators on top of multiple, heterogeneous databases.
operational environment. Ex: IBM Data Joiner, Informix DataBlade
Operational update of data does not occur i the d
O i l d fd d in h data Query d i
Q driven approach:
When a query is posed to a client site, a metadata-dictionary is used
to translate the query into queries appropriate for the individual
Does not require transaction processing, recovery, and heterogeneous sites involved. There queries are then mapped and sent
to local query processors. The results returned from the different
concurrency control mechanisms
sites are integrated into a global answer set.
d l b l
Requires only two operations in data accessing: Complex information filtering and integration processes, compete for
iinitiall lloading of data and access of data.
ii di fd d fd resources.
Inefficient and potentially expensive for frequent queries, especially for
queries requireing aggregations.
q g gg g
Data Warehouse vs Heterogeneous DBMS (2)
vs. DW vs ODB
Using DW update-driven approach Major task of ODB OLTP:
Information from multiple, heterogeneous sources is integrated in advance Day-to-day operations: purchasing, inventory, banking,
and stored in a warehouse for direct querying and analysis. manufacturing, payroll, registration, accounting, etc.
Unlike OLTP, DW do not contain the most current information
OLTP information. DW serve f data analysis and decision making
for d l i dd i i ki OLAP
DW brings high performance to the integrated heterogeneous Distinct Features (OLTP vs. OLAP)
DB system since data are copied, preprocessed, integrated,
copied preprocessed integrated User and system orientation: customer vs. market
U d i i k
annotated, summarized, and restructured into one data store. Data contents: current, detailed vs. historical, consolidated
Query processing in DW does not interfere with the processing Database design: ER + application vs. star + subject
at local sources View: current, local vs. evolutionary, integrated
DW can store and integrate historical information and support
g pp Access patterns: update vs. read-only but complex queries
complex multidimensional queries.
OLTP vs OLAP Why Separate DW?
Clerk IT professional Knowledge worker High performance for both systems:
g p y
function day to day operations decision support DBMS — tuned for OLTP: access methods, indexing,
DB design application-oriented subject-oriented concurrency control, recovery
data current, up-to-date historical, Warehouse — tuned for OLAP: complex OLAP queries,
detailed, flat relational summarized, multidimensional computation of large groups of data at summarized levels,
isolated integrated, consolidated multidimensional view, consolidation.
usage repetitive ad-hoc
access read/write lots of scans Processing OLAP queries in operational databases would
index/hash on prim. key degrade the performance of operational tasks.
unit of work short, simple transaction complex query
In ODB, concurrency control and recovery mechanisms
# records accessed tens millions
(locking, logging) are required to ensure the consistency
#users thousands hundreds
and robustness of transactions.
d b f i
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response OLAP read only access. No need for concurrency
control and recovery
Why Separate DW? (2) Conceptual Modeling of DW
Different functions and different data: Data Cube:
missing data: Decision support requires historical data which
operational DBs do not typically maintain. So, data in ODB is see TSBD Lecture Notes on Visualization of Data Cubes
usually far from complete for decision making.
y p g
Modeling d t
M d li data warehouses: dimensions & measurements
h di i t
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources. ODB Star schema: A single object (fact table) in the middle connected
contain detailed raw data (transactions) which need to be
t i d t il d d t (t ti ) hi h dt b to a number of objects (dimension tables one for each
consolidated before analysis. dimension).
data quality: different sources typically use inconsistent data
q y yp y Snowflake schema: A refinement of star schema where the
representations, codes and formats which have to be dimensional hierarchy is represented explicitly by normalizing
reconciled. the dimension tables.
Fact constellations: Multiple fact tables share dimension tables.
Also known as galaxy schema
Example of Star Schema Snowflake Schema
Date Year Product
Day ProductNo Year Month
Sales Fact Table ProdName Date ProductNo
Month Month Sales Fact Table
Year Date Year Day
QOH Month Category
Store Product QOH
StoreID Cust Store Store
City Customer Cust
CustId City StoreID Customer
C tN City
Country CustCity unit_sales CustName
Region dollar_sales CustCountry CustCity
State dollar_sales CustCountry
Potensi Redundansi Country Country
Bandung, Bogor keduanya Country
ada di Jawa Barat Measurements
View of Warehouses and Hierarchies Data Cube
D t Total annual sales
2Qtr of TV in U.S.A.
1Qtr 3Qtr 4Qtr sum
Importing data sum
Dimension creation Mexico
Data Cube Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed data or
introducing new dimensions
Slice and dice:
project and select
reorient the cube, visualization, 3D to series of 2D planes.
d ill across: iinvolving (
l i (across) more than one fact table.
) th f t t bl
drill through: through the bottom level to its back-end relational tables.
p More info:
Interactive manipulation www.knowledgecenters.org, www.olapreport.com, www.olapcouncil.org
Data Mart Data Mart
DW collects information about subjects that span the A data mart can be either dependent or independent.
entire organization, such as customers, products, sales, assets, A dependent data mart is a subset that is created directly
and personnel. Its scope is enterprise-wide. from the DW.
For DW, fact constellation schema is commonly used Consistent data model
since it can model multiple, interrelated subjects. Providing quality data
Data Mart is a subset of a DW, focuses on a particular DW must be constructed first
subject. Its scope is department-wide. Typically, a data mart Ensures that the user viewing the same version of the data that
consisting of a single subject area (e.g. marketing,
f l b ( k are accessed by all other d warehouse users
d b ll h data h
operations). An independent data mart is a small warehouse designed
For Data Mart, star or snowflake schema are commonly for department, and i source is not an EDW.
f ad d its i EDW
used since both are geared towards modeling single
subjects, although th star schema i more popular.
bj t lth h the t h is l
Data Warehousing Process Overview Data Warehousing Process Overview
The major components of a data warehousing process
Legacy systems, external data providers (e.g. BPS), OLTP,
Data Warehousing Architectures Data Warehousing Architectures
Data Warehousing Architectures Data Warehousing Architectures
Data Integration and the ETL Process Data Integration and the ETL Process
Various integration technologies: ETL
Enterprise Application Integration (EAI) 60-70% of the time in a data-centric project.
A technology that provides a vehicle for pushing data from source Extraction: Reading data from one or more databases
systems i t a data warehouse
t into d t h Transformation
Integrating application functionality and is focused on sharing Converting the extracted data from its previous form into the form in
functionality across systems which it needs to be so that it can be placed into a DW
Traditionally, API. Nowadays, SOA (web services). Load
Enterprise Information Integration (EII) Putting the
An evolving tool space that promises real-time data integration from data
a variety of sources, such as relational databases, Web services, and the DW
A mechanism for pulling data from source systems to satisfy a request
Data Warehouse Development Data Warehouse Development
Direct benefits Some best practices for implementing a DW (Weir, 2002):
Allowing end users to perform extensive analysis in numerous Project must fit with corporate strategy and business objectives
ways There must be complete buy-in to the project by executives,
A consolidated view of corporate data (i.e a single version of
f ( f managers,
managers and users
the truth) It is important to manage user expectations about the completed
Better and more timely information
The data warehouse must be built incrementally
Enhanced system performance. DW frees production Build in adaptability
processing because some operational system reporting
Managed b b h IT and b i
M d by both d business professionals
f i l
requirements are moved to DSS
Develop a business/supplier relationship
Simplification of data access
Only load data th t h
O l l d d t that have been cleansed and are of a quality
b l d d f lit
understood by the organization
Do not overlook training requirements
Be politically aware
Data Warehouse Vendors Data Warehouse Vendors
Computer Associates Microsoft Six guidelines to considered when developing a
g p g
DataMirror Oracle vendor list:
Data Advantage Group
g p SAS 1.
1 Financial strength
Dell Computer Siemens 2. ERP linkages
Embarcadero Technologies Sybase 3. Qualified
Q lifi d consultants
Business Objects Teradata
4. Market share
HP Please visit:
5. Industry experience
Hummingbird Data Warehousing Institute
(tdwi com) 6. Established partnerships
DM Review (dmreview.com)
Real time DW Real-time
Real time DW
Traditionally, updated on a weekly basis.
Unsuitable for some businesses.
Real-time (active) data warehousing
( ) g
The process of loading and providing data via a data
warehouse as they become available
Levels of data warehouses:
1. Reports what happened
2. Some analysis occurs
3. Provides prediction capabilities,
p p ,
5. Becomes capable of making events happen
p g pp
Real time DW From DW to DM [JH]
Three kinds of data warehouse applications
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization t l
i i lt i i li ti tools.
[JH] Jiawei Han and Micheline Kamber, Data Mining:
Concepts and Techniques, Morgan Kaufmann, 2001.
[ET] Efraim Turban et al., Decision Support and Business
Intelligence Systems, Pearson, 2007.
[DO] David Olson and Yong Shi, Introduction to Business
Data Mining, McGraw-Hill, 2007.