I can’t find the data I need
◦ data is scattered over the network
◦ many versions, subtle differences
I can’t get the data I need
need an expert to get the data
I can’t understand the data I
available data poorly documented
I can’t use the data I found
results are unexpected
data needs to be transformed from
one form to other
A single, complete and
consistent store of data
obtained from a variety
of different sources
made available to end
users in a what they can
understand and use in a
Which are our
Who are my customers
and what products
are they buying?
are most likely to go
to the competition ?
What impact will
have on revenue
What product prom-
-otions have the biggest
impact on revenue?
What is the most
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Used by managers and end-users to
understand the business and make
Since 1970s, organizations gained
competitive advantage through systems
that automate business processes to offer
more efficient and cost-effective services to
This resulted in accumulation of growing
amounts of data in operational databases.
A subject-oriented, integrated, time-variant,
and non-volatile collection of data in support
of management’s decision-making process
The warehouse is organized around the
major subjects of the enterprise (e.g.
customers, products, and sales) rather than
the major application areas (e.g. customer
invoicing, stock control, and product
This is reflected in the need to store
decision-support data rather than
The data warehouse integrates corporate
application-oriented data from different
source systems, which often includes data
that is inconsistent.
The integrated data source must be made
consistent to present a unified view of the
data to the users.
Data in the warehouse is only accurate and
valid at some point in time or over some
Time-variance is also shown in the
extended time that the data is held, the
implicit or explicit association of time with
all data, and the fact that the data
represents a series of snapshots.
Data in the warehouse is not updated in real-
time but is refreshed from operational
systems on a regular basis.
New data is always added as a supplement to
the database, rather than a replacement.
Potential high returns on investment
Increased productivity of corporate decision-
The types of queries that a data warehouse
is expected to answer ranges from the
relatively simple to the highly complex and
is dependent on the type of end-user
access tools used.
End-user access tools include:
◦ Reporting, query, and application development
◦ Executive information systems (EIS)
◦ OLAP tools
◦ Data mining tools
What was the total revenue for Scotland in the third quarter of
What was the total revenue for property sales for each type of
property in Great Britain in 2003?
What are the three most popular areas in each city for the renting
of property in 2004 and how does this compare with the figures
for the previous two years?
What is the monthly revenue for property sales at each branch
office, compared with rolling 12-monthly prior figures?
What would be the effect on property sales in the different
regions of Britain if legal costs went up by 3.5% and Government
taxes went down by 1.5% for properties over £100,000?
Which type of property sells for prices above the average selling
price for properties in the main cities of Great Britain and how
does this correlate to demographic data?
What is the relationship between the total annual revenue
generated by each branch office and the total number of sales
staff assigned to each branch office?
Underestimation of resources for data
Hidden problems with source systems
Required data not captured
Increased end-user demands
High demand for resources
Long duration projects
Complexity of integration
A subset of a data warehouse that supports
the requirements of a particular
department or business function.
◦ Focuses on only the requirements of one
department or business function.
◦ Do not normally contain detailed operational
data unlike data warehouses.
◦ More easily understood and navigated.
To give users access to the data they need
to analyze most often.
To provide data in a form that matches the
collective view of the data by a group of
users in a department or business function
To improve end-user response time due to
the reduction in the volume of data to be
To provide appropriately structured data as
dictated by the requirements of the end-
user access tools.
Building a data mart is simpler compared
with establishing a corporate data
The cost of implementing data marts is
normally less than that required to establish
a data warehouse.
The potential users of a data mart are more
clearly defined and can be more easily
targeted to obtain support for a data mart
project rather than a corporate data
Aggregation -- (total sales, percent-to-total)
Comparison -- Budget vs. Expenses
Ranking -- Top 10, quartile analysis
Access to detailed and aggregate data
Complex criteria specification
Need interactive response to aggregate
Accompanying the growth in data
warehousing is an ever-increasing
demand by users for more powerful
access tools that provide advanced
There are two main types of access tools
available to meet this demand, namely
Online Analytical Processing (OLAP) and
OLAP and Data Mining differ in what they
offer the user and because of this they
are complementary technologies.
An environment that includes a data
warehouse (or more commonly one or
more data marts) together with tools
such as OLAP and /or data mining are
collectively referred to as Business
Intelligence (BI) technologies.
The dynamic synthesis, analysis, and
consolidation of large volumes of multi-
dimensional data, Codd (1993).
Describes a technology that uses a multi-
dimensional view of aggregate data to
provide quick access to strategic
information for the purposes of advanced
Enables users to gain a deeper
understanding and knowledge about
various aspects of their corporate data
through fast, consistent, interactive access
to a wide variety of possible views of the
Allows users to view corporate data in such
a way that it is a better model of the true
dimensionality of the enterprise.
Can easily answer ‘who?’ and ‘what?’
questions, however, ability to answer ‘what
if?’ and ‘why?’ type questions distinguishes
OLAP from general-purpose query tools.
Types of analysis ranges from basic
navigation and browsing (slicing and dicing)
to calculations, to more complex analyses
such as time series and complex modeling.
Although OLAP applications are found in
widely divergent functional areas, they all
have the following key features:
◦ multi-dimensional views of data
◦ support for complex calculations
◦ time intelligence
Must provide a range of powerful
computational methods such as that required
by sales forecasting, which uses trend
algorithms such as moving averages and
Key feature of almost any analytical
application as performance is almost always
judged over time.
Time hierarchy is not always used in the
same manner as other hierarchies.
Concepts such as year-to-date and period-
over-period comparisons should be easily
Increased productivity of end-users.
Reduced backlog of applications
development for IT staff.
Retention of organizational control over the
integrity of corporate data.
Reduced query drag and network traffic on
OLTP systems or on the data warehouse.
Improved potential revenue and
Example of two-dimensional query.
What is the total revenue generated by property sales in
each city, in each quarter of 2004?’
Choice of representation is based on types
of queries end-user may ask.
Compare representation - three-field
relational table versus two-dimensional
Example of three-dimensional query.
◦ ‘What is the total revenue generated by property
sales for each type of property (Flat or House) in
each city, in each quarter of 2004?’
Compare representation - four-field
relational table versus three-dimensional
Cube represents data as cells in an array.
Relational table only represents multi-
dimensional data in two dimensions.
Measure - sales (actual, plan, variance)
1 2 3 4 765
Dimensions: Product, Region, Time
Hierarchical summarization paths
Product Region Time
Industry Country Year
Category Region Quarter
Product City Month week
It is a powerful
It provides fast, interactive
It is good for analyzing
It can be useful to find
some clusters and outliners
Many vendors offer OLAP
Andyne Computing --
Arbor Software --
Cognos -- PowerPlay
Comshare -- Commander
Holistic Systems -- Holos
Information Advantage --
Informix -- Metacube
Oracle -- Express
Pilot -- LightShip
Planning Sciences --
Platinum Technology --
ProdeaBeacon, Forest &
SAS Institute -- SAS/EIS,
Speedware -- Media
The process of extracting valid, previously
unknown, comprehensible, and actionable
information from large databases and using
it to make crucial business decisions,
Involves the analysis of data and the use of
software techniques for finding hidden and
unexpected patterns and relationships in
sets of data.
Reveals information that is hidden and
unexpected, as little value in finding
patterns and relationships that are already
Patterns and relationships are identified by
examining the underlying rules and features
in the data.
Most accurate results normally require large
volumes of data to deliver reliable
Starts by developing an optimal
representation of structure of sample data
Data mining can provide huge paybacks for
companies who have made a significant
investment in data warehousing.
Relatively new technology, however already
used in a number of industries.
Retail / Marketing
◦ Identifying buying patterns of customers
◦ Finding associations among customer demographic
◦ Predicting response to mailing campaigns
◦ Market basket analysis
◦ Detecting patterns of fraudulent credit card use
◦ Identifying loyal customers
◦ Predicting customers likely to change their credit
◦ Determining credit card spending by customer
◦ Claims analysis
◦ Predicting which customers will buy new policies
◦ Characterizing patient behavior to predict surgery
◦ Identifying successful medical therapies for
Four main operations include:
◦ Predictive modeling
◦ Database segmentation
◦ Link analysis
◦ Deviation detection
There are recognized associations between
the applications and the corresponding
◦ e.g. Direct marketing strategies use database
Techniques are specific implementations of
the data mining operations.
Each operation has its own strengths and
Similar to the human learning experience
◦ uses observations to form a model of the
important characteristics of some phenomenon.
Uses generalizations of ‘real world’ and
ability to fit new data into a general
Can analyze a database to determine
essential characteristics (model) about the
Model is developed using a supervised
learning approach, which has two phases:
training and testing.
◦ Training builds a model using a large sample of
historical data called a training set.
◦ Testing involves trying out the model on new,
previously unseen data to determine its accuracy
and physical performance characteristics.
Applications of predictive modeling include
customer retention management, credit
approval, cross selling, and direct
There are two techniques associated with
predictive modeling: classification and value
prediction, which are distinguished by the
nature of the variable being predicted.
Used to estimate a continuous numeric
value that is associated with a database
Uses the traditional statistical techniques of
linear regression and nonlinear regression.
Relatively easy-to-use and understand.
Linear regression attempts to fit a straight
line through a plot of the data, such that
the line is the best representation of the
average of all observations at that point in
Problem is that the technique only works
well with linear data and is sensitive to the
presence of outliers (that is, data values,
which do not conform to the expected
Data mining requires statistical methods that
can accommodate non-linearity, outliers, and
Applications of value prediction include credit
card fraud detection or target mailing list
Aim is to partition a database into an
unknown number of segments, or clusters, of
Uses unsupervised learning to discover
homogeneous sub-populations in a database
to improve the accuracy of the profiles.
Less precise than other operations thus
less sensitive to redundant and irrelevant
Applications of database segmentation
include customer profiling, direct
marketing, and cross selling.
Aims to establish links (associations)
between records, or sets of records, in a
There are three specializations
◦ Associations discovery
◦ Sequential pattern discovery
◦ Similar time sequence discovery
Applications include product affinity
analysis, direct marketing, and stock price
Finds items that imply the presence of other
items in the same event.
Affinities between items are represented by
◦ e.g. ‘When a customer rents property for more
than 2 years and is more than 25 years old, in
40% of cases, the customer will buy a property.
This association happens in 35% of all customers
who rent properties’.
Finds patterns between events such that the
presence of one set of items is followed by
another set of items in a database of events
over a period of time.
◦ e.g. Used to understand long term customer buying
Finds links between two sets of data that are
time-dependent, and is based on the degree
of similarity between the patterns that both
time series demonstrate.
◦ e.g. Within three months of buying property, new
home owners will purchase goods such as cookers,
freezers, and washing machines.
Relatively new operation in terms of
commercially available data mining tools.
Often a source of true discovery because it
identifies outliers, which express deviation
from some previously known expectation
Can be performed using statistics and
visualization techniques or as a by-product
of data mining.
Applications include fraud detection in the
use of credit cards and insurance claims,
quality control, and defects tracing.
What is Big Data?
What makes data, “Big” Data?
No single standard definition…
“Big Data” is data whose scale, diversity, and
complexity require new architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it…
◦ 44x increase from 2009 2020
◦ From 0.8 zettabytes to 35zb
Data volume is increasing
Various formats, types, and
Text, numerical, images, audio,
video, sequences, time series,
social media data, multi-dim
Static data vs. streaming data
A single application can be
types of data
To extract knowledge all
these types of data need to
Data is begin generated fast and need to be
Online Data Analytics
Late decisions missing opportunities
◦ E-Promotions: Based on your current location, your purchase
history, what you like send promotions right now for store
next to you
◦ Healthcare monitoring: sensors monitoring your activities and
body any abnormal measurements require immediate
Social media andnetworks
(all of us aregenerating data)
(collecting all sorts of data)
(tracking all objects all the time)
Sensor technology andnetworks
(measuring all kinds ofdata)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
The Model of Generating/Consuming Data has
Old Model: Fewcompanies aregenerating data, all others areconsuming data
New Model: all of us are generating data, and all of us are consuming data
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
Big data is more real-time in
nature than traditional DW
Traditional DW architectures
(e.g. Exadata, Teradata) are
not well-suited for big data
Shared nothing, massively
parallel processing, scale out
architectures are well-suited
for big data apps
The Bottleneck is in technology
◦ New architecture, algorithms, techniques are needed
Also in technical skills
◦ Experts in using the new technology and dealing
with big data