2. A producer wants to know….
Which are our
Which are our
lowest/highest margin
lowest/highest margin Who are my customers
Who are my customers
customers ?
customers ? and what products
and what products
What is the most
What is the most are they buying?
are they buying?
effective distribution
effective distribution
channel?
channel?
What impact will
What impact will Which customers
Which customers
new products/services
new products/services are most likely to go
are most likely to go
have on revenue
have on revenue to the competition ?
to the competition ?
and margins?
and margins? What product prom-
What product prom-
-otions have the biggest
-otions have the biggest
impact on revenue?
impact on revenue?
3. Lot of data everywhere
yet ...
• I can’t find the data I need
– data is scattered over the network
– many versions, subtle differences
• I can’t get the data I need
– need an expert to get the data
• I can’t understand the data I found
– available data poorly documented
• I can’t use the data I found
– results are unexpected
– data needs to be transformed from
one form to other
4. What is a Data Warehouse?
A single, complete and
consistent store of data
obtained from a variety of
different sources made
available to end users in
a what they can
understand and use in a
business context.
[Barry Devlin]
5. What users says...
• Data should be integrated
across the enterprise
• Summary data has a real
value to the organization
• Historical data holds the
key to understanding data
over time
• What-if capabilities are
required
6. What is Data Warehousing?
A process of transforming
data into information and
making it available to users
in a timely enough manner
to make a difference
[Forrester Research, April 1996]
Data
7. Evolution
• 60’s: Batch reports
– hard to find and analyze information
– inflexible and expensive, reprogram every new
request
• 70’s: Terminal-based DSS and EIS
(executive information systems)
– still inflexible, not integrated with desktop tools
• 80’s: Desktop data access and analysis
tools
– query tools, spreadsheets, GUIs
– easier to use, but only access operational
databases
• 90’s: Data warehousing with integrated
OLAP engines and tools
8. Warehouses are Very Large
Databases
35%
30%
25%
Respondents
20%
15%
10%
Initial
5% Projected 2Q96
0% Source: META Group, Inc.
5GB 10-19GB 50-99GB 250-499GB
5-9GB 20-49GB 100-249GB 500GB-1TB
9. Very Large Data Bases
• Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes
• Petabytes -- 10^15 bytes: Geographic Information
Systems
• Exabytes -- 10^18 bytes: National Medical Records
• Zettabytes -- 10^21 bytes: Weather images
• Zottabytes -- 10^24 bytes: Intelligence Agency Videos
10. Data Warehousing --
It is a process
• Technique for assembling and
managing data from various
sources for the purpose of
answering business questions.
Thus making decisions that were
not previous possible
• A decision support database
maintained separately from the
organization’s operational
database
11. Data Warehouse
• A data warehouse is a
– subject-oriented
– integrated
– time-varying
– non-volatile
collection of data that is used primarily
in organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
12. Data Warehouse Subject-oriented
Customers: Get
information of
different prices
of a beer
Farmers: Harvest
information from
known access
paths
13. Data Warehouse Subject-oriented
Students: Get
information about
various
universities in
U.K.
Explorers: Seek
out the unknown
and previously
unsuspected
rewards hiding in
the detailed data
14. Data Warehouse Subject-oriented
• Focusing on the modelling and
analysis of data for decision makers,
not on daily operations or transaction
processing
• Provide a simple and concise view
around particular subject issues by
excluding data that are not useful in the
decision support process
15. Data Warehouse Subject-oriented
Enterprise
“Database”
Customers Orders
Transactions
Vendors Etc…
Data Miners:
Etc… • “Farmers” – they know
• “Explorers” - unpredictable
Copied,
organized
summarized
Data Data Mining
Warehouse
17. Data Warehouse :Time - variant
• The time horizon for the data warehouse is
significantly longer than that of operational
systems
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
– Operational database: current value data
• Every key structure in the data warehouse
– Contains an element of time explicitly or implicitly, while
the key of operational data may or may not contain “time
element”
20. Data Mart
• A Data Mart is a smaller, more focused
Data Warehouse – a mini-warehouse.
• A Data Mart typically reflects the business
rules of a specific business unit within an
enterprise.
21. Data Warehouse to Data Mart
Decision
Support
Data Mart Information
Decision
Data Support
Data Mart Information
Warehouse
Decision
Support
Data Mart Information
22. DATA MARTS
• Create many DM’s
• Limited scope
Examples:
1. Financial DM
2. Marketing DM
3. Supply chain DM
24. Transaction (Operational)
Data
• Operational (production) systems create
(massive number of) transactions, such
as sales, purchases, deposits,
withdrawals, returns, refunds, phone
calls, toll roads, web site “hits”, etc…
• Transactions are the base level of data –
the raw material for understanding
customer behavior
• Unfortunately, operational systems
change due to changing business needs
• Fortunately, operational systems can
usually be changed to support changing
business needs
• Data warehousing strategies need to be
aware of operational system changes
25. Operational Summary Data
Summaries are for a
specific time period Other Examples???
and utilize the
transaction data for
that time period
26. Decision Support Summary Data
• The data that are used to help make
decisions about the business
– Financial Data, such as:
• Income Statements (Profit & Loss)
• Balance Sheets (Assets – Liabilities = Net
Worth)
– Sales summaries
– Other examples???
• Data warehouses maintain this type of
data, however financial data “of record”
(for audit purposes) usually comes
from databases and not the data
warehouse (confusing???)
• Generally, it is a bad idea to use the
same system for analytic and
operational purposes
27. Data Warehouse for Decision
Support
• Putting Information technology to help
the knowledge worker make faster and
better decisions
– Which of my customers are most likely to
go to the competition?
– What product promotions have the biggest
impact on revenue?
– How did the share price of software
companies correlate with profits over last
10 years?
28. Decision Support
• Used to manage and control business
• Data is historical or point-in-time
• Optimized for inquiry rather than update
• Use of the system is loosely defined and can
be ad-hoc
• Used by managers and end-users to
understand the business and make
judgements
29. Database Schema
• Database schema defines the structure of data,
not the values of the data (e.g., first name, last
name = structure; Ron Norman = values of the
data)
• In RDBMS:
– Columns = fields = attributes (A,B,C)
– Rows = records = tuples (1-7)
31. Physical Database Schema
• Describes the data the way it will be stored in an
RDBMS which might be different than the way the
logical shows it
32. Metadata
• General definition: Data about data !!!
– Examples:
• A library’s card catalog (metadata)
describes publications (data)
• A file system maintains permissions
(metadata) about files (data)
• A form of system documentation
including:
– Values legally allowed in a field (e.g., AZ,
CA, OR, UT, WA, etc.)
– Description of the contents of each field
(e.g., start date)
– Date when data were loaded
– Indication of currency of the data
(last updated)
– Mappings between systems
(e.g., A.this = B.that)
• Invaluable, otherwise have to
research to find it
33. Business Rules
• Highest level of abstraction from
operational (transaction) data
• Describes why relationships exist and
how they are applied
• Examples:
– Need to have 3 forms of ID for credit
– Only allow a maximum daily withdrawal of
$200
– After the 3rd log-in attempt, lock the log-in
screen
– Accept no bills larger than $20
– Others???
34. General Architecture for Data
Warehousing
• Source systems
• Extraction, (Clean),
Transformation, &
Load (ETL)
• Central repository
• Metadata repository
• Data marts
• Operational
feedback
• End users
(business)
35. DATA WAREHOUSE SCOPE
Broad :
Required for
companies, Very
costly, May
be divided according
to Depts.
Narrow:
Required for
Personal information
36. Design of a Data Warehouse: A
Business Analysis Framework
• Four views regarding the design of a data
warehouse
– Top-down view
• allows selection of the relevant information necessary for
the data warehouse
– Data source view
• exposes the information being captured, stored, and
managed by operational systems
– Data warehouse view
• consists of fact tables and dimension tables
– Business query view
• sees the perspectives of data in the warehouse from the
view of end-user
37. Data Warehouse Design Process
• Top-down, bottom-up approaches or a combination of both
– Top-down: Starts with overall design and planning
– Bottom-up: Starts with experiments and prototypes (rapid)
• From software engineering point of view
– Waterfall: structured and systematic analysis at each step before
proceeding to the next
– Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around
• Typical data warehouse design process
– Choose a business process to model, e.g., orders, invoices, etc.
– Choose the grain (atomic level of data) of the business process
– Choose the dimensions that will apply to each fact table record
– Choose the measure that will populate each fact table record
38. Multi-Tiered Architecture
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools
39. Three Data Warehouse Models
• Enterprise warehouse
– collects all of the information about subjects spanning
the entire organization
• Data Mart
– a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data
mart
• Virtual warehouse
– A set of views over operational databases
– Only some of the possible summary views may be
materialized
40. Data Mining works with
Warehouse Data
• Data Warehousing provides
the Enterprise with a
memory
• Data Mining provides the
Enterprise with intelligence
41. We want to know ...
• Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?
• Which types of transactions are likely to be fraudulent given
the demographics and transactional history of a particular
customer?
• If I raise the price of my product by Rs. 2, what is the effect on
my ROI?
• If I offer only 2,500 airline miles as an incentive to purchase
rather than 5,000, how many lost responses will result?
• If I emphasize ease-of-use of the product as opposed to its
technical capabilities, what will be the net effect on my
revenues?
• Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
42. Application Areas
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
43. Data Mining in Use
• Data Mining can be used to track fraud
• A Supermarket becomes an information
broker
• Basketball teams use it to track game strategy
• Cross Selling
• Warranty Claims Routing
• Holding on to Good Customers
• Weeding out Bad Customers
45. Operational Systems
• Run the business in real time
• Based on up-to-the-second
data
• Optimized to handle large
numbers of simple read/write
transactions
• Optimized for fast response
to predefined transactions
• Used by people who deal with
customers, products --
clerks, salespeople etc.
• They are increasingly used
by customers
46. On Line Transaction Process
(OLTP)
It refers to a class of
systems that facilitate
and manage
transaction-oriented
applications, typically
for data entry and
retrieval transaction
processing
47. On Line Transaction Process
(OLTP)
OLTP technology is used in a
number of industries, including
banking, airlines, mail order,
supermarkets, and manufacturing.
Applications include electronic
banking, order processing,
employee time clock systems, e-
commerce, and eTrading. The
most widely used OLTP system is
probably IBM's CICS.
48. What are Operational Systems?
• They are OLTP systems
• Run mission critical
applications
• Need to work with
stringent performance
requirements for routine
tasks
• Used to run a business!
49. RDBMS used for OLTP
• Database Systems have been
used traditionally for OLTP
– clerical data processing tasks
– detailed, up to date data
– structured repetitive tasks
– read/update a few records
– isolation, recovery and
integrity are critical
50. Operational Summary Data
Summaries are for a
specific time period Other Examples???
and utilize the
transaction data for
that time period
51. Examples of Operational Data
Data Industry Usage Technology Volumes
Customer All Track Legacy application, flat Small-medium
File Customer files, main frames
Details
Account Finance Control Legacy applications, Large
Balance account hierarchical databases,
activities mainframe
Point-of- Retail Generate ERP, Client/Server, Very Large
Sale data bills, manage relational databases
stock
Call Telecomm- Billing Legacy application, Very Large
Record unications hierarchical database,
mainframe
Production Manufact- Control ERP, Medium
Record uring Production relational databases,
AS/400
53. Application-Orientation vs.
Subject-Orientation
Application-Orientation Subject-Orientation
Operational Data
Database Warehouse
Credit
Loans Customer
Card
Vendor
Trust Product
Savings Activity
54. OLTP vs. Data Warehouse
• OLTP systems are tuned for known transactions
and workloads while workload is not known a
priori in a data warehouse
• Special data organization, access methods and
implementation methods are needed to support
data warehouse queries (typically
multidimensional queries)
– e.g., average amount spent on phone calls between
9AM-5PM in Pune during the month of December
55. OLTP vs Data Warehouse
• OLTP • Warehouse (DSS)
– Application Oriented – Subject Oriented
– Used to run business – Used to analyze
– Detailed data business
– Current up to date – Summarized and refined
– Isolated Data – Snapshot data
– Repetitive access – Integrated Data
– Clerical User – Ad-hoc access
– Knowledge User
(Manager)
56. OLTP vs Data Warehouse
• OLTP • Data Warehouse
– Performance Sensitive – Performance relaxed
– Few Records accessed at – Large volumes accessed
a time (tens) at a time(millions)
– Mostly Read (Batch
– Read/Update Access Update)
– Redundancy present
– No data redundancy
– Database Size 100
– Database Size 100MB - GB - few terabytes
100 GB
57. OLTP vs Data Warehouse
• OLTP • Data Warehouse
– Transaction – Query throughput is
throughput is the the performance
performance metric metric
– Thousands of users – Hundreds of users
– Managed in entirety – Managed by subsets
58. To summarize ...
• OLTP Systems are
used to “run” a
business
• The Data Warehouse
helps to “optimize” the
business
59. Why Separate Data
Warehouse?
• Performance
– Op dbs designed & tuned for known txs & workloads.
– Complex OLAP queries would degrade perf. for op txs.
– Special data organization, access & implementation methods
needed for multidimensional views & queries.
• Function
– Missing data: Decision support requires historical data, which op
dbs do not typically maintain.
– Data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many heterogeneous
sources: op dbs, external sources.
– Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to be reconciled.
60. INFORMATION SYSTEMS
• Designed to support
decision-making based on
1. Historical data
2. Prediction data.
• Designed for complex queries
or data-mining applications.
Examples:
1. Sales trend analysis,
2. Customer segmentation
3. Human resources planning
62. DIFFERENCE
Characteristics Operational Systems Informational Systems
Purpose Real time data entry Real and analyze
historical data.
Primary users Clerks, sales-persons, Managers, business
administrations analysts, customers
Scope of usage Narrow, planned, and Broad, ad hoc, complex
simple updates and queries and analysis
queries
Design goal Performance throughput, Ease of flexible access
availability and use
Volume Many, constant updates Periodical batch updates
and queries on one or a and queries requiring
few table rows many or all rows