1. Objectives
Motivation: Why data warehouse?
What is a data warehouse?
Why separate DW?
y p
Conceptual modeling of DW
Data Mart
Data Warehousing Architectures
Data Warehousing and OLAP Data Warehouse Development
Lecture 2/DMBI/IKI83403T/MTI/UI
Data Warehouse Vendors
Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id) Real-time DW
R l
Faculty of Computer Science, University of Indonesia
2
Motivation: Why data warehouse? What is a data warehouse? [JH]
Construction of data warehouses (DW) involves data Defined in many different ways, but not rigorously.
cleaning and data integration important A decision support database that is maintained separately
preprocessing step for data mining (DM). from the organization’s ODB.
DW provide OLAP for the interactive analysis of Support information processing by providing a solid platform
of consolidated, historical data for analysis.
multidimensional data, which facilitates effective DM.
,
“A data warehouse is a subject-oriented, integrated,
Data mining functions can be integrated with OLAP
time-variant, and nonvolatile collection of data in
operations to enhance interactive mining of knowledge.
support of management’s decision-making process.” —
DW will provide an effective platform for DM. W. H. Inmon
While DW
Whil DWs are not requirements to do DM, DW store
t i t t d DM t Case Study 2: Continental Airlines flies high with its
massive amounts of data that can be uses for DM. [DO] real-time data warehouse
3 4
2. What is a data warehouse? [ET] Subject Oriented
Data warehouse
Organized around major subjects, such as
A physical repository where relational data are specially
customer, product, sales.
organized to provide enterprise-wide, cleansed data in a
standardized format. Provide i l
P id a simple and concise view around
d i i d
Characteristics particular subject issues by excluding data that
Subject oriented, Integrated, Time Variant, Non-volatile are not useful in the decision support process.
Web-based, Relational/multidimensional, Client/server, Real-time
Focusing on the modeling and analysis of data
Include metadata
for decision makers, not on daily operations or
Data warehousing
Process of constructing and using data warehouses. transaction processing
processing.
Requires data integration, data cleaning, and data consolidation.
5 6
Integrated Time Variant
Integrate multiple, heterogeneous data sources The time horizon for the data warehouse is significantly
Relational databases, flat-files, on-line transaction records longer than that of operational systems.
Data cleaning and data integration techniques are
g g q Operational database: current value data.
applied Data warehouse data: provide information from a historical
Ensure consistency in naming conventions, encoding perspective (e g past 5-10 years)
(e.g.,
structures, attribute measures, etc. among different data
Every key structure in the data warehouse
sources
Contains an element of time, explicitly or implicitly
E.g., Hotel price: currency, tax, breakfast covered, etc.
But the key of operational data may or may not contain “time
When d i
Wh data is moved to the warehouse, it is converted.
d h h i i d element”.
7 8
3. Non-volatile
Non volatile Data Warehouse vs Heterogeneous DBMS
vs.
A physically separate store of data transformed from the
p y y p Traditional heterogeneous DB integration:
Build wrappers/mediators on top of multiple, heterogeneous databases.
operational environment. Ex: IBM Data Joiner, Informix DataBlade
Operational update of data does not occur i the d
O i l d fd d in h data Query d i
Q driven approach:
h
When a query is posed to a client site, a metadata-dictionary is used
warehouse environment.
to translate the query into queries appropriate for the individual
Does not require transaction processing, recovery, and heterogeneous sites involved. There queries are then mapped and sent
to local query processors. The results returned from the different
concurrency control mechanisms
y
sites are integrated into a global answer set.
d l b l
Requires only two operations in data accessing: Complex information filtering and integration processes, compete for
iinitiall lloading of data and access of data.
ii di fd d fd resources.
resources
Inefficient and potentially expensive for frequent queries, especially for
q
queries requireing aggregations.
q g gg g
9 10
Data Warehouse vs Heterogeneous DBMS (2)
vs. DW vs ODB
vs.
Using DW update-driven approach Major task of ODB OLTP:
Information from multiple, heterogeneous sources is integrated in advance Day-to-day operations: purchasing, inventory, banking,
and stored in a warehouse for direct querying and analysis. manufacturing, payroll, registration, accounting, etc.
Unlike OLTP, DW do not contain the most current information
OLTP information. DW serve f data analysis and decision making
for d l i dd i i ki OLAP
DW brings high performance to the integrated heterogeneous Distinct Features (OLTP vs. OLAP)
DB system since data are copied, preprocessed, integrated,
copied preprocessed integrated User and system orientation: customer vs. market
U d i i k
annotated, summarized, and restructured into one data store. Data contents: current, detailed vs. historical, consolidated
Query processing in DW does not interfere with the processing Database design: ER + application vs. star + subject
vs
at local sources View: current, local vs. evolutionary, integrated
DW can store and integrate historical information and support
g pp Access patterns: update vs. read-only but complex queries
read only
complex multidimensional queries.
11 12
4. OLTP vs OLAP Why Separate DW?
OLTP OLAP
users Clerk,
Clerk IT professional Knowledge worker High performance for both systems:
g p y
function day to day operations decision support DBMS — tuned for OLTP: access methods, indexing,
DB design application-oriented subject-oriented concurrency control, recovery
data current, up-to-date historical, Warehouse — tuned for OLAP: complex OLAP queries,
detailed, flat relational summarized, multidimensional computation of large groups of data at summarized levels,
isolated integrated, consolidated multidimensional view, consolidation.
,
usage repetitive ad-hoc
access read/write lots of scans Processing OLAP queries in operational databases would
index/hash on prim. key degrade the performance of operational tasks.
unit of work short, simple transaction complex query
In ODB, concurrency control and recovery mechanisms
# records accessed tens millions
(locking, logging) are required to ensure the consistency
#users thousands hundreds
and robustness of transactions.
d b f i
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response OLAP read only access. No need for concurrency
control and recovery
recovery.
13 14
Why Separate DW? (2) Conceptual Modeling of DW
Different functions and different data: Data Cube:
missing data: Decision support requires historical data which
operational DBs do not typically maintain. So, data in ODB is see TSBD Lecture Notes on Visualization of Data Cubes
usually far from complete for decision making.
y p g
Modeling d t
M d li data warehouses: dimensions & measurements
h di i t
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources. ODB Star schema: A single object (fact table) in the middle connected
contain detailed raw data (transactions) which need to be
t i d t il d d t (t ti ) hi h dt b to a number of objects (dimension tables one for each
tables,
consolidated before analysis. dimension).
data quality: different sources typically use inconsistent data
q y yp y Snowflake schema: A refinement of star schema where the
representations, codes and formats which have to be dimensional hierarchy is represented explicitly by normalizing
reconciled. the dimension tables.
Fact constellations: Multiple fact tables share dimension tables.
Also known as galaxy schema
15 16
5. Example of Star Schema Snowflake Schema
Product
Date Year Product
Day ProductNo Year Month
Sales Fact Table ProdName Date ProductNo
Month Month Sales Fact Table
ProdDesc ProdName
Year Date Year Day
Category
C ProdDesc
Date
QOH Month Category
Product
Store Product QOH
Store
StoreID Cust Store Store
City Customer Cust
CustId City StoreID Customer
State CustName
C tN City
Cit CustId
unit_sales City
Country CustCity unit_sales CustName
State State
Region dollar_sales CustCountry CustCity
State dollar_sales CustCountry
Yen_sales
Potensi Redundansi Country Country
Yen_sales
Measurements
Bandung, Bogor keduanya Country
Region
ada di Jawa Barat Measurements
17 18
View of Warehouses and Hierarchies Data Cube
Date
D t Total annual sales
2Qtr of TV in U.S.A.
1Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
USA
VCR
Importing data sum
Country
Ca ada
Canada
Table Browsing
Dimension creation Mexico
Dimension browsing
sum
Cube building
g
Cube browsing
19 20
6. Data Cube Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed data or
data,
introducing new dimensions
Slice and dice:
project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill
d ill across: iinvolving (
l i (across) more than one fact table.
) th f t t bl
Visualization
drill through: through the bottom level to its back-end relational tables.
OLAP capabilities
p More info:
21
Interactive manipulation www.knowledgecenters.org, www.olapreport.com, www.olapcouncil.org
22
Data Mart Data Mart
DW collects information about subjects that span the A data mart can be either dependent or independent.
entire organization, such as customers, products, sales, assets, A dependent data mart is a subset that is created directly
and personnel. Its scope is enterprise-wide. from the DW.
For DW, fact constellation schema is commonly used Consistent data model
since it can model multiple, interrelated subjects. Providing quality data
Data Mart is a subset of a DW, focuses on a particular DW must be constructed first
subject. Its scope is department-wide. Typically, a data mart Ensures that the user viewing the same version of the data that
consisting of a single subject area (e.g. marketing,
f l b ( k are accessed by all other d warehouse users
d b ll h data h
operations). An independent data mart is a small warehouse designed
For Data Mart, star or snowflake schema are commonly for department, and i source is not an EDW.
f ad d its i EDW
used since both are geared towards modeling single
subjects, although th star schema i more popular.
bj t lth h the t h is l
23 24
7. Data Warehousing Process Overview Data Warehousing Process Overview
The major components of a data warehousing process
Data sources
Legacy systems, external data providers (e.g. BPS), OLTP,
ERP Systems
Data extraction
Data loading
Comprehensive database
Metadata
Middleware tools
25 26
Data Warehousing Architectures Data Warehousing Architectures
27 28
8. Data Warehousing Architectures Data Warehousing Architectures
29 30
Data Integration and the ETL Process Data Integration and the ETL Process
Various integration technologies: ETL
Enterprise Application Integration (EAI) 60-70% of the time in a data-centric project.
A technology that provides a vehicle for pushing data from source Extraction: Reading data from one or more databases
systems i t a data warehouse
t into d t h Transformation
Integrating application functionality and is focused on sharing Converting the extracted data from its previous form into the form in
functionality across systems which it needs to be so that it can be placed into a DW
Traditionally, API. Nowadays, SOA (web services). Load
Enterprise Information Integration (EII) Putting the
An evolving tool space that promises real-time data integration from data
d into
a variety of sources, such as relational databases, Web services, and the DW
multidimensional databases
A mechanism for pulling data from source systems to satisfy a request
for information.
31 32
9. Data Warehouse Development Data Warehouse Development
Direct benefits Some best practices for implementing a DW (Weir, 2002):
Allowing end users to perform extensive analysis in numerous Project must fit with corporate strategy and business objectives
ways There must be complete buy-in to the project by executives,
A consolidated view of corporate data (i.e a single version of
f ( f managers,
managers and users
the truth) It is important to manage user expectations about the completed
project
Better and more timely information
The data warehouse must be built incrementally
Enhanced system performance. DW frees production Build in adaptability
processing because some operational system reporting
Managed b b h IT and b i
M d by both d business professionals
f i l
requirements are moved to DSS
Develop a business/supplier relationship
Simplification of data access
Only load data th t h
O l l d d t that have been cleansed and are of a quality
b l d d f lit
understood by the organization
Do not overlook training requirements
33 34
Be politically aware
Data Warehouse Vendors Data Warehouse Vendors
Computer Associates Microsoft Six guidelines to considered when developing a
g p g
DataMirror Oracle vendor list:
Data Advantage Group
g p SAS 1.
1 Financial strength
Dell Computer Siemens 2. ERP linkages
Embarcadero Technologies Sybase 3. Qualified
Q lifi d consultants
l
Business Objects Teradata
4. Market share
HP Please visit:
5. Industry experience
Hummingbird Data Warehousing Institute
(tdwi.com)
(tdwi com) 6. Established partnerships
p p
Hyperion
H
DM Review (dmreview.com)
IBM
Informatica
35 36
10. Real-time
Real time DW Real-time
Real time DW
Traditionally, updated on a weekly basis.
Unsuitable for some businesses.
Real-time (active) data warehousing
( ) g
The process of loading and providing data via a data
warehouse as they become available
y
Levels of data warehouses:
1. Reports what happened
2. Some analysis occurs
3. Provides prediction capabilities,
p p ,
4. Operationalization
5. Becomes capable of making events happen
p g pp
37 38
Real-time
Real time DW From DW to DM [JH]
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization t l
i i lt i i li ti tools.
39 40
11. References
[JH] Jiawei Han and Micheline Kamber, Data Mining:
Concepts and Techniques, Morgan Kaufmann, 2001.
[ET] Efraim Turban et al., Decision Support and Business
Intelligence Systems, Pearson, 2007.
[DO] David Olson and Yong Shi, Introduction to Business
Data Mining, McGraw-Hill, 2007.
41