11. BI1-M1 Introduction to Business Intelligence
Benefit: Reduce Action Time
Added
Value
Time
Operational
Transaction
Data Available
Analysis Results
Available
Decision
Made
Action
Implemented
Data access
time
Analysis
time
Decision
time
Implementation
time
Action time
Added Value loss
Source: Hackathorn, R.: Minimizing Action Distance (2003),
http://www.tdan.com/view-articles/5132/
SAP UA
Page *
34. „multi-dimensional“ mean?What is „OLAP“?OLTP versus
OLAPBusiness Intelligence tool box
Agenda
M
D6
1
Cr
ea
te
fo
re
ca
st
40
1
M
E5
9N
Co
nv
er
t t
71. e M
RP
Chapter 16 : Business intelligence
Paul Mireault (HEC Montréal)
This chapter introduces the student to the topic of Business
Intelligence (BI). BI comprises a set of tools whose purpose is
to help
decision makers make better decisions by presenting them with
the pertinent data, and allowing them to do analysis. For
business
users to use BI properly, they must understand proper
infrastructure set up by IT specialists.
16.1 Introduction
“Intelligence” is information gathered by a government or other
institution to guide decisions and actions (Boyer, 2001). While
many associate this definition with espionage, which is the act
of covertly gathering such information, its use in the business
world does not have the cloak and dagger connotation that is
the fodder of espionage novels and movies. But its fundamental
objective is the same : guiding decisions and actions.
Business Intelligence (BI) is a set of processes (data gathering,
data analysis), technologies (data warehouse), and presentation
tools (report generator, dashboard) used by organizations to
analyse data (either internal or external) in order to gain new
insight on their environment (customers, suppliers) and make
better decisions.
O2, a mobile phone company, uses BI to reduce its churn rate
(i.e. clients leaving them for a competitor). It estimates that its
churn rate of 30% is costing them over €270m per year. A
reduc-
72. tion of 1% would then save them €9m per year. Churn analy-
sis identifies the characteristics and behaviour of clients on the
threshold of leaving, and informs client managers ahead of time
so they can offer incentives for remaining with the company.
(ComputerWeekly.com, 2010)
Hallmark, a greeting card manufacturer, uses data from its
13 million loyalty program customers to figure out how to
engage customers all year, not only during holidays and special
occasions. (Computerworld.com, 2011)
The FBI is using BI to identify fraudulent housing transactions,
and to look for patterns suggesting the presence of identity
theft rings in given areas. (Computerworld.com, 2007)
Those are but a few examples of companies and organizations
using Business Intelligence to improve their profitability or
become more efficient.
Originally, BI was restricted to advanced analytical tools used
by analysts with degrees in data analysis. But now, BI’s
purview
encompasses information presentation tools as well, like
management reports and dashboards.
Analytical tools seek new knowledge and insight related to the
data amassed through the organization’s daily operations, with
the expectation that this new knowledge will help managers
think of new products, policies, or processes that benefit the
organization.
Presentation tools are designed to help managers make
decisions related to the current state of their operations. They
present key business indicators and compare them with either
previous values, showing trends and direction of changes, or
with the similar indicators in other categories, showing distribu-
73. tions.
In the context of an ERP, presentation and analytical tools
present the information using the same naming conventions
that the user sees in his normal interactions with the ERP. Tools
closely linked to the inner workings of the ERP offer this
desired
characteristic thanks to their access to the ERP’s data dictionary
and standardized data structures.
But for the users to be able to perform significant analysis some
important preparation work has to be done by the IT special-
ists. They need to set up the proper working environment,
called Data Warehouse, which will not interfere with normal
daily operations and, at the same time, make it easy for users
to manipulate data without being bothered by technical details.
This behind-the-scenes work is crucial to the success of a BI
implementation. We will explain concepts related to the design
of the Data Warehouse as well as its operation.
The first part of the chapter will present the business user’s
view
of a BI environment, and the second part will present the behind
the scenes work that has to be done by the technical developers.
Mireault – Business intelligenceCHAPTER 16
Readings on Enterprise Resource Planning
Preliminary Version - send comments to [email protected] 209
M
D6
1
110. od
uc
tio
n
or
de
r
40
9
30
5
M
D0
1
Ex
ec
ut
e M
RP
16.2 Using Business Intelligence
Business Intelligence tools can be categorized in two broad
categories. The first category consists of analytical tools that
use a wide of range of mathematical methods to analyse data
and, hopefully, discover new information. The second category
consists of presentation tools that present data, basic or aggre-
111. gated, in a visual format that lets users manipulate it themselves
by changing the way it is presented, hopefully leading to new
insight.
Analytical tools are usually designed for data analysts who have
been schooled in advanced mathematical techniques. Two
major programs are SAS Data Miner and SPSS Clementine.
On the other hand, presentation tools are designed for the more
occasional user : managers and directors who don’t use them all
day. They can be as simple as pivot tables in a spreadsheet.
They
can also be easy to interact but more complicated to develop,
like dashboards.
16.2.1 AnAlyTICAl Tools
16.2.1.1 Process
The analyst must understand the origin of the data he analy-
ses. That may seem obvious, but different data sources may use
homonyms to represent slightly different concepts, which could
then affect the analysis itself. For example, in one business unit
the term Client may refer to somebody who has made at least
one purchase, but in another business unit the term Client may
include prospective clients. Using the same term for slightly
different concepts will lead to problems when we analyse data
from the two business units.
Units of measure are a common cause of incompatible data. For
example, Company A buys Company B and merges its historical
data with its own in the data warehouse. But Company A counts
the boxes sold in the field QTY SOLD, and Company B counts
the
number of items in the boxes in its QTY SOLD field. The result
would be much higher quantities in Company B’s data.
112. While it may be tempting to use all the available data to build a
model, (Berry & Linoff, 2004) recommend partitioning the
avail-
able data into three sets : A training set to develop the model,
a validation set “ to adjust the initial model to make it more
general and less tied to the idiosyncrasies of the training set.”,
and a test set to evaluate the model’s performance with data
unused during development. Analysis packages, like the afore-
mentioned SAS Data Miner and SPS Clementine, will easily
parti-
tion the whole data set into the recommended sets.
16.2.1.2 Techniques
There are many techniques to analyse data, too numerous to list
here. And new ones are invented regularly. We will just present
a few of them.
There are two categories of analysis. One group of techniques
aims to analyse data and find some sort of structure to gather
understanding. This group uses traditional statistical analysis
tools, like hypothesis tests and confidence interval, as well as
non-statistical methods like the ABC analysis.
The other group aims to predict events. Traditional linear
regres-
sion is an example many readers already know.
Regression Analysis defines a dependent variable as a function
of one or more independent variables. This technique can only
be used with quantitative variables measured on a scale. Linear
regression is taught in most undergraduate programs and is
useful when the relationship between the dependant variable
and each independent variable is linear.
113. When the relationship between the dependant variable and
some independent variable is not linear, then the data analyst
can use more advanced non-linear regression techniques.
ABC Analysis is a technique that classifies items according to
their
relative importance and produces three groups. This technique
is based on the Pareto Principle, also called the 80-20 rule,
which
is commonly used in business : 80% of your revenues come
from
20% of your clients, and 80% of your sales are made by 20%
of your products. The actual values 80 and 20 must be taken
lightly. In the ABC Analysis, we divide our data into three
groups.
Group A contains the smallest number of items representing a
total of about 80% of the measured value. Group B contains the
smallest number of the remaining items representing a total of
about 15% of the measured value. Group C contains the left
over
items.
Table 1- Sales Data for ABC Analysis
CITy sAlEs
Berlin 1517
Boston 72099
Madrid 34302
Montreal 10328
New York 1915
114. Paris 8974
Toronto 1284
Mireault – Business intelligenceCHAPTER 16
Readings on Enterprise Resource Planning
Preliminary Version - send comments to [email protected] 210
M
D6
1
Cr
ea
te
fo
re
ca
st
40
1
M
E5
9N
Co
151. Ex
ec
ut
e M
RP
For example, consider the sales data shown in Table 1. We than
calculate the sales percentages and order the data in decreasing
Sales order, as shown in Table 2.
Table 2 - Sales Data in Decreasing Sales Order
CITy sAlEs PCT sAlEs CUM PCT sAlEs GRoUP
Boston 72099 55.28% 55.28% A
Madrid 34302 26.30% 81.58% A
Montreal 10328 7.92% 89.50% B
Paris 8974 6.88% 96.38% B
New York 1915 1.47% 97.85% C
Berlin 1517 1.16% 99.02% C
Toronto 1284 0.98% 100.00% C
We can then assign Boston and Madrid to Group A, Montreal
and
Paris to Group B and New York, Berlin and Toronto to Group
C.
152. Associative Analysis refers to techniques that try to find
associa-
tions in the data. A popular associative analysis technique is the
Market Basket Analysis. “Market basket analysis provides
insight
into the merchandise by telling us which products tend to be
purchased together and which are more amenable to promo-
tions.” (Berry & Linoff, 2004) p. 287.
Knowing which products are often bought together, like a
clothes washer and dryer, or in sequence, like a flight, a car
rental
and a hotel room, can be valuable information for marketing
purposes.
A Decision Tree is a classification model that divides a
population
into small homogeneous groups with respect to a categorical
variable.
A simple categorical variable may be Membership Renewed
which
may take the values Yes or No. A decision tree could be built
using the clients’ demographical data and all the service calls
and complaints to determine the probability that an individual
client may not renew his subscription. Clients who are
identified
as having the highest probability on non-renewal may then be
targeted with special incentives to reduce that probability.
The major publishers of data analysis software are SAS and
SPSS,
now a division of IBM.
16.2.1.3 Outliers
153. Outliers are cases whose values are unnaturally far from the
majority of the other cases. While it is not unusual to have
values
in the extremities of a distribution, there are situations where
some abnormally extreme values occur and can affect the data
analysis we want to perform.
For example, a small jewellery store may have an average daily
sales of 10 000$, with daily sales extreme of 50 000$ happen-
ing three or four times a year. This week, it offers a 100$ rebate
on all purchases over 1 000$. A billionaire comes in and buys 1
000 000$ worth of rings and necklaces. Now, an analysis on the
rebate’s effectiveness may show a great impact. But is it realis-
tic ? First of all, we may doubt that the billionaire was attracted
by the 100$ rebate. And if the actual sales that week were 1 020
000$, most people would conclude that the rebate did not have
any impact.
For this reason, many data analysts start their analysis by
looking
for outliers and removing them from their data sets. An easy
way to identify data points that can be considered outliers is
to use the mean and the standard deviation (values computed
by any statistical analysis program, measuring, respectively, the
central point and the dispersion of the data set.) Values that are
within 3 standard deviations of the mean are usually the result
of
normal randomness. Values that are outside that range should
be examined more closely. You should remove only values for
which you can give a proper explanation, like in the example
above. Data analyst can use more advanced outlier detection
techniques. (Grubbs, 1969) (Rousseeuw & Leroy, 1996)
16.2.2 BI PREsEnTATIon Tools
Managers often need to visualize data in summary representa-
154. tions like tables and charts to help them answer questions or
identify problems in their area. They formulate their requests
for
information in ways that are very similar.
Mireault – Business intelligenceCHAPTER 16
Readings on Enterprise Resource Planning
Preliminary Version - send comments to [email protected] 211
M
D6
1
Cr
ea
te
fo
re
ca
st
40
1
M
E5
9N
191. 1
Ex
ec
ut
e M
RP
Here are a few typical requests :
• I want to see sales totals by month.
• I want to see total production per product per month.
• I want to see the monthly average number of defects per
employee.
• I want the year-to-date sales by week, for this year and the
similar periods last year.
Those requests have an intrinsic structure that becomes appar-
ent when you analyse them. First, there is the thing the manager
needs to see : sales amount, production quantity, and the
number of defects. We call those things measures and they are
usually numerical data aggregated in some way : total, average,
and running total.
Secondly, there are the grouping terms : month, employee, and
week. We call them attributes and they can be quantitative or
qualitative data, like month, employee, and week. They are
usually recognized by the use of the words by and per.
Figure 1 – Multidimensional Data Model
192. Attributes that pertain to a single concept are grouped togeth-
er to form a dimension. Thus, the attributes Year, Month, Date,
and Weekday form a Time dimension. The attributes that form
a logical sequence from the broadest, with a small number of
values, to the most specific, having the largest number of values
form a hierarchy.
Measures, dimensions and hierarchies are illustrated in a
Multidimensional Data Model (Golfarelli & Rizzi, 2009), as
shown
in Figure 1. This example indicates that we are interested in
three
measures related to sales : Quantity Sold, Price, and Number of
Orders. We have three dimensions (Time, Client and Product)
with corresponding hierarchies, and we also have two attributes
(Weekday and Brand) that are not part of any hierarchy but are
of interest to the data analysts. There are natural hierarchies,
like
Time, but you can create artificial hierarchies to suit your
needs.
An American retailer could create a hierarchy Region, State and
City, where Region can have the value Northeast or Midwest.
An
international retailer could have a hierarchy Region, Country
and
City, where Region can have the value Europe or South Pacific.
In
these examples, the Region dimension has different meanings.
Time is a special case that needs careful attention. In our every-
day activities we use homonyms to describe time concepts that
are different. Consider the following expressions :
• What day are we today ? Oh! It’s Tuesday.
193. • I’m going on vacation in three days.
• I want the total sales per day from January 10, 2011 to January
14, 2011.
In the first case, day refers to a generic day of the week. In the
second case, day refers to a period of time. And in the last case,
day refers to specific dates. The time hierarchy deals with
specific
dates.
16.2.2.1 Queries
Queries are the basis of all data extractions. A query is a
request
to extract data from a database system. Queries need to specify
many details about the information that the user wants : which
data elements (called fields), where they are in the database (in
which tables), what condition they must satisfy (called a
criteria,
which can be used to specify a date range for example), what
calculations need to be performed (for example, multiplying
price by quantity to get an amount) and what groupings are
wanted (for example, group by products and calculate the sum
of the amounts sold).
Mireault – Business intelligenceCHAPTER 16
Readings on Enterprise Resource Planning
Preliminary Version - send comments to [email protected] 212
M
D6
230. pr
od
uc
tio
n
or
de
r
40
9
30
5
M
D0
1
Ex
ec
ut
e M
RP
Queries are used in dashboards to obtain data that will be used
to calculate indicators (see section 16.2.2.2). They are used in
reports to provide printed data. They are also used to construct
the fact table that is the base of a data cube (see section
231. 16.2.2.4).
With many data analysis programs, queries can be specified
with a point-and-click method, where the analyst selects data
tables and data fields, can indicate a selection criteria, and can
perform some basic mathematical calculations. For example,
you could ask for the total sales per salesperson per month.
Figure 2 shows how this query is written with Microsoft Access,
and Figure 3 shows its result.
Figure 2 - Simple Query with Graphical Interface
Mireault – Business intelligenceCHAPTER 16
Readings on Enterprise Resource Planning
Preliminary Version - send comments to [email protected] 213
M
D6
1
Cr
ea
te
fo
re
ca
st
40
268. 30
5
M
D0
1
Ex
ec
ut
e M
RP
Figure 3 - Result of the Simple Query
While simple to use, this approach usually cannot perform
sophisticated extractions. It would be hard to get the top selling
product per month, which is a simple request, in appearance
(see Figure 4).
If an analyst needs to extract data using complex criteria,
he usually has access to SQL (Structured Query Language), a
language used to write queries. While programmers use SQL
to develop operational data base information systems, it can
also be used to extract data from a data warehouse. The SQL
query used to produce the result shown in Figure 4 is shown in
Figure 5. Explaining the structure of SQL is beyond the scope
of this chapter ; the interested reader can consult (Pratt & Last,
2009) for an introduction and (Celko, 2011) for advanced SQL
programming.
Mireault – Business intelligenceCHAPTER 16
269. Readings on Enterprise Resource Planning
Preliminary Version - send comments to [email protected] 214
M
D6
1
Cr
ea
te
fo
re
ca
st
40
1
M
E5
9N
Co
nv
er
t t
345. 6 GROUP BY to_char (order_date,’YYYY-MM’), product_id
7 HAVING SUM (qty* price)
8 >= ALL (SELECT sum (qty* price)
9 FROM order o2
10 JOIN order_line USING (order_id)
11 WHERE to_char (o2. order_date,’YYYY-MM’) = to_char
(o1. order_date,’YYYY-MM’)
12 GROUP BY product_id
13 )
14 ORDER BY
Figure 5 - Complex Query
16.2.2.2 Dashboard
Dashboards are information presentation tools designed to
help decision makers see at a glance all the pertinent informa-
tion related to their domain. Like a car dashboard, with which
we
are familiar, a management dashboard will show key indicators
managers can use to evaluate the state of their business unit.
For a car, the usual key indicators are speed, headlight status,
oil
pressure, remaining fuel, turn signal activation, etc. In a
manage-
ment context, many key indicators come from financial state-
ments, like net profit, but the manager may also want indicators
346. specific to his needs. For example, a purchasing manager may
want to see the delivery delay for each supplier.
Like a car dashboard, a management dashboard may represent
many indicators in a compact format. A well designed
dashboard
will present information in a way that is most appropriate for
the decision maker : line charts, pie charts, bar charts, tabular
data with color coded indicators. Geographical information may
be superimposed on a map to show it in a way that tables and
charts can’t do justice (see Figure 6). A dashboard can also be
interactive and present data according to where the user clicks,
as shown in Figure 7.
Figure 6 - Representation of Data on a Map
Mireault – Business intelligenceCHAPTER 16
Readings on Enterprise Resource Planning
Preliminary Version - send comments to [email protected] 216
M
D6
1
Cr
ea
te
fo
re
383. r
40
9
30
5
M
D0
1
Ex
ec
ut
e M
RP
Figure 7 - Dashboard with Dynamic Charts
Dashboard development tools are offered by the major
data warehouse management systems. SAP offers SAP
BusinessObjects Xcelsius Enterprise, IBM has Cognos Business
Intelligence and Oracle has Analytics Workspace Manager.
Dashboard design is a balancing act between simplicity and
complexity. On one hand, we want to present information that
is simple to understand, and on the other hand we want to
present information that is complex.
The data elements shown in a dashboard are extracted from
the data warehouse with queries. There are situations where
data elements may come from the operational database, but
384. those are mostly limited to dashboards used by operational
managers. In such situations, care must be taken that the under-
lying queries do not adversely affect the operations themselves :
a complex query may slow down the operational system.
Alerts are associated with indicators. The dashboard designer
must set up threshold levels for all the desired alert warnings.
For example, a low cash reserve may be accompanied by a red
light, a medium sized cash reserve by a yellow light, and a high
cash level by a green light. Figure 6 illustrates states with low
revenues in red.
Some warnings may be associated with both low and high
values : too much stock in inventory may need the same atten-
tion from the warehouse manager as too little stock.
16.2.2.3 Reports
Managers have used reports since the beginning of computer-
ized information systems. Until recently, reports were printed
and visually examined by managers to see if anything needed
attention. Modern reports are now produced as files that can
be analyzed further by the manager using a spreadsheet like
Microsoft Excel.
Reports are designed, usually by systems analysts, according to
the information needed by different users.
Mireault – Business intelligenceCHAPTER 16
Readings on Enterprise Resource Planning
Preliminary Version - send comments to [email protected] 217
422. report is an extraction of the basic data from the database, like
the list of all orders received yesterday, or the list of products
in
the warehouse with their current stock.
Reports can be produced at regular intervals, according to the
user’s needs. The inventory manager may want to see the stock
level report every day, and the VP of Sales may want to see the
regional sales report every month.
Occasional or alert reports are produced only when a pre-
defined
situation occurs. Thus, the Production VP may receive a special
report when more than 5% of raw materials have a level corres-
ponding to less than 3 days of production. It could indicate that
production may cease if those raw materials are not received
soon. An alert report is, by itself, important information that
can lead to action. The design of the report includes not only
the information content but also the event definition, which
must be balanced between the reduction of false alerts and the
reaction time to be able to act on the problem.
Figure 8 - MDM of Cube Data
Reports often combine detailed information and different levels
of aggregation. For example, the inventory report containing the
list of every product in the warehouse, along with its quantity
and its value, may group products by product type or by brand,
and compute a subtotal value by product type or brand.
Such reports are useful for a visual examination of their
content.
But their structure needs to be defined in advance, and any
variation needs to be programmed by the IS specialists.
The major publishers of reporting software are SAP Crystal
423. Report and IBM Cognos Business Intelligence Query and
Reporting.
16.2.2.4 Data Cubes and Pivot Tables
Pivot tables are the logical evolution of reports, with the major
improvement being that the user can easily redefine them. They
can present detailed as well as aggregate information.
A data cube is a representation of a fact table presenting
measures organized by dimensions and hierarchies. While a
cube is limited to three dimensions, the underlying data table
may have more than three dimensions. In that case, we call it
a hyper-cube. Its full visual representation becomes impossible
with the limitations of two-dimensional sheets of paper and
computer monitors.
The data cube we will use in our example comes from the
Multidimensional Data Model of Figure 8, and its data is shown
in Table 3.
Figure 9 presents the data cube corresponding to the data
presented in Table 3.
Mireault – Business intelligenceCHAPTER 16
Readings on Enterprise Resource Planning
Preliminary Version - send comments to [email protected] 218
M
D6
1
460. uc
tio
n
or
de
r
40
9
30
5
M
D0
1
Ex
ec
ut
e M
RP
Figure 9 - Data Cube
Pivot tables are the visual representation of projections of the
hyper-cube on one or two dimensions. These dimensions are
called vertical dimension and horizontal dimension. Pivot tables
try to make the most of those two physical dimensions by
461. allowing the user to combine more than one cube dimension in
the table’s vertical or horizontal dimensions.
There are a few basic operations that we can perform on
pivot tables.
Pivoting
When we interchange the dimensions among its axes, we are
pivoting the table. The data shown in the table does not change,
but the user’s view and perception are different. A different
view
may provide different insight on the data.
slicing
Slicing is the act of selecting a specific value for one
dimension.
For example, Figure 10 shows the slice for the product Gizmo.
Figure 10 - Slice for Product Gizmo
Dicing
Dicing is a generalization of slicing, where we choose more
than
one specific values for one dimension or more.
For example, Figure 11 shows the sales for Q1 and Q2 for
Montreal and Boston.
Mireault – Business intelligenceCHAPTER 16
Readings on Enterprise Resource Planning
Preliminary Version - send comments to [email protected] 219
499. Figure 11 - Dice for Q1 and Q2, Montreal and Boston
Drill Down & Roll Up
Drilling down and rolling up are operations using a dimension’s
hierarchy. Looking at the Period dimension, we can roll up from
the quarters to the years, and from the cities to the regions. The
table then contains less detailed data, as illustrated in Figure 12.
We could also drill down from the cities dimensions to the
neigh-
bourhood (if we had access to more detailed information.) The
table would then contain more detailed data.
Figure 12 - Roll Up and Drill Down
Mireault – Business intelligenceCHAPTER 16
Readings on Enterprise Resource Planning
Preliminary Version - send comments to [email protected] 220
M
D6
1
Cr
ea
te
fo
re
536. de
r
40
9
30
5
M
D0
1
Ex
ec
ut
e M
RP
16.3 Infrastructure supporting BI
Performing Business Intelligence analyses can be success-
ful only if you have the data available for those analyses. “You
can’t use data that you don’t have,” might have said Yogi
Berra1.
The data used for BI must have a source and it must be stored
somewhere.
This section will explain the technology infrastructure behind
the use of BI, along with the behind the scenes work that must
be performed regularly to maintain a viable BI initiative. While
this work is usually done by an organization’s IT department,
BI
analysts must be familiar with it because of its impact on data
537. quality. Ideally, BI analysts should participate in the design of
the
infrastructure and its processes.
The major data warehouse vendors are Teradata, Oracle, IBM
InfoSphere, and SAP Business Information Warehouse.
16.3.1 ConCEPT
An organization’s information systems are the engine that
makes it run smoothly. In the course of day-to-day operations,
an
organization records information that originates externally, like
client orders, or is produced internally, like product
manufactur-
ing data, and is then used by different departments to perform
required processes.
For example, a client’s order is recorded in the database by the
order-entry system. That order’s data, along with other orders,
is used by the manufacturing system to produce ordered
products and to record product availability in the database.
Then, the logistics system uses client orders and product avail-
ability to schedule and execute deliveries, and record delivery
data. Finally, the accounting system uses the delivery data to
produce and send invoices to clients, recording invoice data in
the database.
This is how information systems functioned, for decades, before
the advent of Business Intelligence.
The operational database is not well suited for BI applications.
Being designed for efficiency and performance, data analysts
found that it hindered their ability to analyze data properly.
Even though it usually enforces business rules and integrity
538. constraints, an operational database does not have the same
data quality requirements needed by data analysts. Typos in
names will not affect operations, but they may have an impact
on some analyses.
Also, an operational database is not designed to record the
evolution of data. When a client changes his address, the new
address usually overwrites the old address. This may affect
data analysis that now associates all past sales to the new
neighbourhood.
When no longer needed, data from an operational database
can be archived and purged to reclaim disk space for current
operations. The data analyst then loses precious data.
Finally, the data analysts can perform analyses that can slow
the operational database down to a crawl by overwhelming
the database management system with resource consuming
queries. For most organizations, even a small slowdown is not
tolerable.
16.3.2 DATA soURCEs
The data analyst may need data not from the operational
database. For example, when analysing sales data for ice
cream, it may be interesting to correlate it with each day’s
maximum temperature. Such data comes from external sources.
Consequently, the operational database is an internal source.
External sources have to be chosen for their reliability and
official status. You should not choose a random exchange rate
on the Internet, but rather use your bank or your government’s
central bank. Also, there are practical considerations to take
into
account : if your bank publishes exchange rates with a 6-month
lag, then you might look for a source with fresher data.
539. External data sources may be free or may need to be purchased.
Usually, weather information can be obtained for free on most
governments’ web sites. But you have to purchase financial data
from Bloomberg and market data from Neilsen.
16.3.3 WAREHoUsE
Those reasons have led to the development of data warehouses.
A data warehouse is a type of database better suited for business
intelligence applications, using data structures designed for
data analysis rather than for operational systems.
Because data warehouses are huge, some organizations
create restricted views of the data for different users. These
views are called data marts, and they are usually accessible by
specific business areas, like marketing, manufacturing, human
resources, etc.
1 Yogi Berra was the manager of the New York Mets and the
New York Yankees baseball teams in the 1970’s and the 1980’s.
He is known for uttering truisms like “It
ain’t over till it’s over” and “You can observe a lot by
watching.” (But he never said anything about data.)
Mireault – Business intelligenceCHAPTER 16
Readings on Enterprise Resource Planning
Preliminary Version - send comments to [email protected] 221
M
D6
577. dimension
table, describing in detail each of its values.
The chapter’s Appendix illustrates a sales information fact
table (Table 4) with measures for the number of sales made
(NbSales), the total quantity sold (QtySold), the total amount
sold (AmountSold), aggregated for the dimensions ProductID,
PeriodID, and CityID.
The source of this fact table is the detailed sales transactions of
the operational database. Figure 13 illustrates the correspond-
ing star schema.
16.3.3.2 Snowflake Schema
The snowflake schema is an extension to the star schema that
represents hierarchies. The fact table is built with the most
specific dimension of a hierarchy. Then, each successive dimen-
sion of the hierarchy is linked to the previous one.
To build a snowflake schema from the star schema of Figure 13,
we transform the City Dimension Table shown in Table 5 into
a revised City Dimension Table (Table 8), a Country Dimension
Table (Table 9), and a Region Dimension Table (Table 10).
We also transform the Product Dimension Table of Table 7 into
a revised Product Dimension Table (Table 11) and a Category
Dimension Table (Table 12). The resulting snowflake schema is
shown in Figure 14.
Both schemas are functionally identical. Choosing one or the
other is a matter of trading redundancy with complexity. In a
star
schema, we memorize redundant information : The fact that
the country USA is in the region North America is repeated for
578. each city in the warehouse, which may be thousands of times.
The snowflake schema saves storage space, but adds a layer of
complexity whenever we need to see the Region field : the
query
must follow an extra link.
The business manager does not have to deal with this decision :
reports and dashboards use queries that are preprogrammed
by the IT specialists. The data analyst who writes his own SQL
queries will need to know the structure chosen by the data
warehouse designer.
Figure 13 - Star Schema
Mireault – Business intelligenceCHAPTER 16
Readings on Enterprise Resource Planning
Preliminary Version - send comments to [email protected] 222
M
D6
1
Cr
ea
te
fo
re
ca