2. DATA WAREHOUSING INTRODUCTION
• A data warehouse is a central repository of information that can be
analyzed to make more informed decisions.
3. THE WAREHOUSING APPROACH
Data
Warehouse
Clients
Source Source
Source
. . .
Extractor/
Monitor
Integration System
. . .
Metadata
Extractor/
Monitor
Extractor/
Monitor
Information
integrated in
advance
Stored in wh for
direct querying
and analysis
4. ADVANTAGES OF WAREHOUSING APPROACH
• High query performance
• But not necessarily most current information
• Doesn’t interfere with local processing at sources
• Complex queries at warehouse
• OLTP at information sources
• Information copied at warehouse
• Can modify, annotate, summarize, restructure, etc.
• Can store historical information
• Security, no auditing
• Has caught on in industry
5. DATA WAREHOUSE EVOLUTION
TIME
2000
1995
1990
1985
1980
1960 1975
Information-
Based
Management
Data
Revolution
“Middle
Ages”
“Prehistoric
Times”
Relational
Databases
PC’s and
Spreadsheets
End-user
Interfaces
1st DW
Article
DW
Confs.
Vendor DW
Frameworks
Company
DWs
“Building the
DW”
Inmon (1992)
Data Replication
Tools
6. WHAT IS A DATA WAREHOUSE?
A PRACTITIONERS VIEWPOINT
“A data warehouse is simply a single, complete, and consistent
store of data obtained from a variety of sources and made
available to end users in a way they can understand and use it in a
business context.”
-- Barry Devlin, IBM Consultant
7. A DATA WAREHOUSE IS...
• Stored collection of diverse data
• A solution to data integration problem
• Single repository of information
• Subject-oriented
• Organized by subject, not by application
• Used for analysis, data mining, etc.
• Optimized differently from transaction-oriented db
• User interface aimed at executive
8. A DATA WAREHOUSE IS... (CONTINUED)
• Large volume of data (Gb, Tb)
• Non-volatile
• Historical
• Time attributes are important
• Updates infrequent
• May be append-only
• Examples
• All transactions ever at WalMart
• Complete client histories at insurance firm
• Stockbroker financial information and portfolios
10. WAREHOUSE IS A SPECIALIZED DB
Standard DB
• Mostly updates
• Many small transactions
• Mb - Gb of data
• Current snapshot
• Index/hash on p.k.
• Raw data
• Thousands of users (e.g.,
clerical users)
Warehouse
Mostly reads
Queries are long and complex
Gb - Tb of data
History
Lots of scans
Summarized, reconciled data
Hundreds of users (e.g.,
decision-makers, analysts)
11. DATA WAREHOUSE ARCHITECTURES:
CONCEPTUAL VIEW
• Single-layer
• Every data element is stored once only
• Virtual warehouse
• Two-layer
• Real-time + derived data
• Most commonly used approach in
industry today
“Real-time data”
Operational
systems
Informational
systems
Derived Data
Real-time data
Operational
systems
Informational
systems
12. THREE-LAYER ARCHITECTURE: CONCEPTUAL
VIEW
• Transformation of real-time data to derived data really requires
two steps
Derived Data
Real-time data
Operational
systems
Informational
systems
Reconciled Data
Physical Implementation
of the Data Warehouse
View level
“Particular informational
needs”
14. WHAT IS DATA MINING?
• Data Mining is:
• (1) The efficient discovery of previously unknown, valid,
potentially useful, understandable patterns in large datasets
• (2) The analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in
novel ways that are both understandable and useful to the
data owner
15. OVERVIEW OF TERMS
• Data: a set of facts (items) D, usually stored in a database
• Pattern: an expression E in a language L, that describes a subset
of facts
• Attribute: a field in an item I in D.
• Interestingness: a function ID,L that maps an expression E in L
into a measure space M
17. EXAMPLES OF LARGE DATASETS
• Government: IRS, NGA, …
• Large corporations
• WALMART: 20M transactions per day
• MOBIL: 100 TB geological databases
• AT&T 300 M calls per day
• Credit card companies
• Scientific
• NASA, EOS project: 50 GB per hour
• Environmental datasets
18. EXAMPLES OF DATA MINING APPLICATIONS
1. Fraud detection: credit cards, phone cards
2. Marketing: customer targeting
3. Data Warehousing: Walmart
4. Astronomy
5. Molecular biology
19. THE DATA MINING PROCESS
1. Understand the domain
2. Create a dataset:
Select the interesting attributes
Data cleaning and preprocessing
3. Choose the data mining task and the specific algorithm
4. Interpret the results, and possibly return to 2
20. HOW DATA MINING IS USED
1. Identify the problem
2. Use data mining techniques to transform the data into informatio
3. Act on the information
4. Measure the results
The slides for this text are organized into several modules. Each lecture contains about enough material for a 1.25 hour class period. (The time estimate is very approximate--it will vary with the instructor, and lectures also differ in length; so use this as a rough guideline.) This lecture is the first of two in Module (1).
Module (1): Introduction (DBMS, Relational Model)
Module (2): Storage and File Organizations (Disks, Buffering, Indexes)
Module (3): Database Concepts (Relational Queries, DDL/ICs, Views and Security)
Module (4): Relational Implementation (Query Evaluation, Optimization)
Module (5): Database Design (ER Model, Normalization, Physical Design, Tuning)
Module (6): Transaction Processing (Concurrency Control, Recovery)
Module (7): Advanced Topics