Lecture Thirteen- Data Warehousing, OLAP, and Data Mining
• Required: Connolly and Begg, sections 25.1, 26.1.1, 26.2.1, and
• Elective: Connolly and Begg, the remainder of chapter 26.
Supporting OLAP and Data Mining
Other than supporting day-to-day decisions of a large number of operational
users, a DBMS also need to support analysis and strategic decisions for
For example, analyzing the performance of sales before and after a market
campaign. Or a manager analyze the performance of sales in order to forecast
product sales and plan accordingly for product ordering and storage capacity.
(Prepared by BR-Cheung) 13-1/15
Such an on-line analytical processing (OLAP) can be enhanced with the use of
data mining tools.
It is a tools which discover new patterns or rules that cannot necessarily
be found with mere querying processing.
They utilize AI machine learning techniques that automatically classify
the data into different groups based on different criteria.
For example, it is possible from data on product sales to derive a rule that if a
customer shops on Sunday before 11 a.m. and buys milk, the customer buys a
newspaper and chocolate cookies as well. When a store manager wishes to
promote a particular brand of chocolate cookies, she can use the above rule and
stack the chocolate cookies next to the Sunday newspaper stand.
OLAP and data mining involve no modifications, and require ad hoc access to all
the organization's data, current as well as historical. This points to the need for
(Prepared by BR-Cheung) 13-2/15
new data models for organization and storage of historical data—models that
optimize query processing rather than transaction processing.
What is a Data Warehouse?
Data warehouses integrates data from multiple data sources and organize them
for efficient querying processing and presentation.
The best definition for data warehouses was given by Inmon (1992) when he
introduced the data warehouse term: a data warehouse is a subject-oriented,
integrated, non-volatile, time-variant collection of data in support of
• Subject-oriented collection means that data are organized around subjects
such as customers, products, sales. In databases, by contrast, the stored data
are organized around activities. For example, we use databases to store
detailed data about purchasing orders and product acquisition. We use a data
warehouse to store the summarization of the detailed data based on a subject.
A summarization can be produced, for example, by applying aggregate
(Prepared by BR-Cheung) 13-3/15
functions on groups of rows that are specified using the GROUP BY clause.
For instance, a summarization around a product can be product sales,
• SELECT Product, SUM(Total) AS ProductSale
• FROM PurchaseOrders
• GROUP BY Product;
and a summarization around a sale can be daily sales,
SELECT WorkDay, SUM(Total) AS DailySale
GROUP BY WorkDay;
An aggregation of daily sales yields the example of time series that we
discussed in 5.1.2.
(Prepared by BR-Cheung) 13-4/15
• Integrated collection means that a data warehouse integrates and stores data
from multiple sources, not all of which are necessarily databases. A data
source can be a certain application file. Note that a data warehouse is not an
example of a multi-database system that just provides access to data stored in
heterogeneous databases. A data warehouse actually stores the integrated data
after they are cleaned, removing any inconsistencies such as different formats
and erroneous values. In this way, users are presented with a consistent
unified view of the data in the data warehouse.
• Non-volatile collection means that the data warehouse is not updated in real
time and in coordination with the updates on the data sources. Updates on a
data source are grouped and applied together on the data warehouse by a
maintenance transaction. Maintenance transactions execute periodically or on
• Time-variant collection means that data in a data warehouse is historical data
whose values have temporal validity. This clearly shows that data
warehouses must support time series.
(Prepared by BR-Cheung) 13-5/15
Data Warehouse Architecture
(Prepared by BR-Cheung)
Multidimensional models populate data in multidimensional matrices. Three-
dimensional (3-d) matrices are called data cubes, and matrices with more than
three dimensions are called hypercubes.
As an example of a cube, consider the dimensions: fiscal period, product, region.
As we have mentioned earlier, we can use a 2-d matrix such as a spreadsheet, to
represent regional sales for a fixed period.
| R1 R2 R3 ...
A spreadsheet can be converted into a cube by adding a time dimension to the
spreadsheet, such as, for example, month intervals.
(Prepared by BR-Cheung)
Data-Warehouse-Enhanced Query Operations
Data warehouses provide a multidimensional conceptual view with unlimited
dimensions and aggregation levels. They offer several operators to facilitate both
querying and visualizing data in a multidimensional view.
• Pivot or Rotation: Cubes can be visualized and reoriented along different
axes. In the example above, product and region are presented in the front.
Using rotation we can bring time and product in the front, pushing region to
• Roll-Up Display: It can be used to derive a coarse-grain view—that is, a
summarization and grouping along a dimension. For example, months can
be grouped into years along the time axes. Products can be grouped into
categories, toys, computer games, etc.
• Drill-Down Display: It can be used to derive a finer-grain view, that is to
say, a disaggregation along a dimension. For example, regions can be
(Prepared by BR-Cheung)
disaggregated to cities within each region, and months can be disaggregated
into weeks or days.
• Slice and Dice: It can be used to specify projection on dimensions, creating
smaller cubes. For example, retrieve all the toy products in cities in
Pennsylvania during the winter months.
• Selection: This is similar to the standard select in SQL that can be used to
retrieve data by value and range.
• Sorting: It can be used to specify the ordering of data along a dimension.
• Derived Attributes: It allows the specification of attributes that are computed
from stored and other derived values.
Multidimensional Storage Model
Data warehouses support the summarization provided by the drill-down and roll-
up operations, in one of two ways, both of which trade time for space.
(Prepared by BR-Cheung)
• They maintain smaller summary tables that are retrieved to display a
• They encode all different levels along a dimension (e.g., weekly, quarterly,
yearly) into existing tables. Using the appropriate encoding, a
summarization is computed from the detailed data when needed.
Tables in a data warehouse are logically organized in a so-called star schema.
A star schema consists of a central fact table containing the factual data that can
be analyzed in a variety of ways, and also a dimension table for each dimension,
containing reference data. Detailed data are stored in the dimension tables and
are referenced by foreign keys in the fact table.
For example, a star schema that can support the above example would consist of
a fact table, surrounded by three dimension tables—one for Products, one for
regional sales, and one for month intervals.
(Prepared by BR-Cheung) 13-10/
SALE SUMMARY (Product, Month. Region, Sales);
Product -> PRODUCT(PID)
MONTH -> MONTH_INTERVAL(Month)
Region -> Regional_Sales(RegionNo)
PRODUCT (PID, Pname, PCategory, PDescription);
REGIONAL_SALES (Region, County, City);
MONTH_INTERVAL (MonthNo, Month, Year);
In the star schema, dimension tables might not be normalized, containing
redundant data (recall our discussion on normalization). The motivation for this
redundancy is to speed up querying processing by eliminating the need for
On the other hand, an unnormalized table can grow very large and the overhead
of scanning the table may offset any gain in query processing. In such a case,
dimension tables can be normalized by decomposing them into smaller
dimension tables, and referencing them within the original dimension table. This
(Prepared by BR-Cheung) 13-11/
decomposition leads to a hierarchical "star" schema called the Snowflake
As in databases, data warehouses utilize different forms of indexing to speed up
access to data. Further, they implement efficient handling of dynamic sparse
Issues and Categories of Data Warehouses
Compared to databases, data warehouses are very costly to build both in terms of
time and money. Furthermore, they are very costly to maintain.
• Data warehouses have huge sizes and grow at enormous rates. They are at
least one order of magnitude larger than their data sources. They range
between hundreds of gigabytes to terabytes and petabyte sizes.
• Resolving semantic heterogeneity between data sources, and converting
different bodies of data from their data source models, to the data warehouse
model is a complex and time-consuming process. This process is not
(Prepared by BR-Cheung) 13-12/
executed only once, but is repeated almost every time that the data
warehouse is synchronized with its data sources. An example of a
heterogeneity that must be resolved each time is when two data sources are
using different currencies and follow different fiscal calendars.
• Cleaning data to ensure quality and validity is another complex process. In
fact, it has been identified as the most labor-intensive process of the data
warehouse construction. The reason is that recognizing incomplete or
erroneous data is difficult to automate, at least in the beginning. After being
recognized, if some erroneous or missing entries follow a specific pattern,
then these entries can be automatically corrected. For example, if a data
source uses -99 to represent the NULL value in a specific column, then all
-99 in that column can be converted into NULL. Another example: using the
zip code, one can correct the misspelling of a city. For instance, Pittsburgh,
PA ends with "h" whereas Pittsburg, CA does not.
• The decision of what data to summarize and how to organize it is another
very critical process. It affects both the utility of the data warehouse and its
(Prepared by BR-Cheung) 13-13/
• The sheer volume of data makes data loading and synchronization a
significant task. The data warehouse must be recovered from incomplete or
It is clear that administration and management tools are essential in such a
complex environment. It seems, nowadays, that data administration is shifting
away from database systems to data warehouses.
In order to reduce the severity of the above issues, two other alternatives to an
enterprise-wide data warehouse have been proposed:
• Data Marts: These are small, highly focussed data warehouses at the level of
a department. An enterprise-wide data warehouse can be constructed by
forming a federation of data marts.
• Virtual Data Warehouses: These are persistent collections of views of
operational databases, that are materialized for efficient access and complex
(Prepared by BR-Cheung) 13-14/
Transaction Scheduling and Concurrency.
In exam3, you might be asked to extend your E-Commerce project
implementation. Instead of extending your implementation, you could
extend the sample library project. Make sure you have the skills to build a
web-based database application. The best way to master the skills and be
prepared for exam3 is to complete your project.
(Prepared by BR-Cheung) 13-15/