This case study describes an approach to building quasi real-time OLAP cubes in Microsoft SQL Server Analysis Services to enable daily comparisons of production forecasts and outcomes. The cubes are partitioned by time to allow independent and frequent updates. Initial attempts failed due to deadlocks from simultaneous partition updates. The working solution takes advantage of a 6 day work week by switching partition dates on Saturdays only and reprocessing partitions then. This allows real-time and historical partition updates without gaps or overlaps in the data.
2. OLAP Cubes - what is it?
●Used to quickly analyze and retrieve data from different
perspectives
●Numeric data
●Structured data:
ocan be represented as numeric values (or sets
thereof) accessed by a composite key
oeach of the parts of the composite key belongs to a
well-defined set of values
●Facts = numeric values
●Dimensions = parts of the composite key
●Source = usually a start or snowflake schema in a
relational DB (other sources possible)
3. OLAP Cubes - data sources
Star schema Snowflake schema
Production
outcome
Product
Date
District
Sub-
district
Production
outcome
Product
Date
District
Year
Month
Day of week
4. OLAP Facts and dimensions
●Every "cell" in an OLAP cube contains
numeric data a.k.a "measures".
●Every "cell" may contain more than
one measure, e.g. forecast and
outcome.
●Every "cell" has a unique
combination of dimension values.
District
Product
5. OLAP Cubes - operations
●Slice = choose values corresponding to
ONE value on one or more dimensions
●Dice = choose values corresponding to
one slice or a number of consecutive
slices on more than 2 dimensions of
the cube
6. OLAP Cubes - operations (cont'd)
●Drill down/up = choose lower/higher
level details. Used in context of
hierarchical dimensions.
●Pivot = rotate the orientation of the
data for reporting purposes
●Roll-up
7. OLAP Cubes - refresh methods
●Incremental:
opossible when cubes grow
"outwards", i.e. no "scattered"
changes in data
oonly delta data need to be read
orefresh may be fast if delta is small
●Full:
opossible for all cubes, even when
changes are "scattered" all over
thedata
oall data need to be re-read with
every
orefresh may take long time (hours)
Time
Cube data
Updates
Time
Cube data
New data
8. The situation on hand
●Business operating on 24*6 basis (Sun-Fri)
●Events from production systems are aggregated into flows
and production units
●Production figures may be adjusted manually long after
production date
●Daily production figures are basis for daily forecasts with
the simplified formula:
forecast(yearX) = production(yearX-1) * trend(yearX) + manualFcastAdjustm
●Adjustments in production figures will alter forecast figures
●Outcome and forecast should be stored in MS OLAP cubes as
per software architecture demands
●The system should simplify comparisons between forecast
and outcome figures
9. Software
●Source of data:
oRelational database
oOracle 10g database
oextensive use of PL/SQL in database
●Destination of data:
oOLAP cubes - MS SQL Server Analysis Services (version
2005 and 2008)
●Other software:
oMS SQL Server database
10. QUESTION
Can we get almost real-time reports from MS OLAP cubes?
ANSWER
YES! The answer lies in "cube partitioning".
11. Cube partitioning - the basics
●Cube partitions may be updated independently
●Cube partitions may not overlap (duplicate values may
occur)
●Time is a good dimension to partition on
Time
12. MS OLAP cube partitioning - details
●Every cube partition has its own query to define the data
set fetched from the data source
●The SQL statements define the non-overlapping data sets
Relational DB
Partitioned cube
SQL query a SQL query b SQL query c SQL query d
Data source
13. MS OLAP cube partitioning - details
Relational DB
Partitioned cube
SQL query a SQL query b SQL query c SQL query d
Data source
Dim Dim Dim
Facts
Small amount of
data
Large amount of
data
14. How to partition? - theory
●Partitions with different lengths and different update
frequencies:
ocurrent data = very small partition, very short update
times, updated often
o"not very current" data = a bit larger partition, longer
update times, updated less often
ohistorical data = large partition, long update times,
updated seldom
●Operation 24x6 delivers the "seldom" window
15. How to partition? - theory cont'd
●One cube for both forecast and outcome
Forecast measure
Outcome measure
One year
into the future
Now
Last
month
Last yearHistory
16. Solution - approach one
Decisions:
●Cubes partitioned on date boundaries
●MOLAP cubes (for better queryperformance)
●Use SSIS to populate cubes
odimensions populated by incremental processing
ofacts populated by full processing
ojobs for historical data must be run after midnight to
compensate for date change
Actions:
●Cubes built
●SSIS deployed inside SQL Server (and not filesystem)
●SSIS set up as scheduled database jobs
17. Did it work?
No!
Malfunctions:
●Simultaneous updates of cube partitions could lead to
deadlocks
●Deadlocks left cube partitions in unprocessed state
Amendment:
●Cube partitions must not be updated simultaneously
18. Solution - approach two
Decisions:
●Cube processing must be ONE partition at a time
●Scheduling done by SSIS "super package":
oSQL Server table contains approx. frequency and package
names
o"super package" executes SSIS packages as indicated by
the table
Actions:
●Scheduling table created
●"Super package" created to be self-modifying
19. Did it work?
Not really!
Malfunctions:
●Historical data had to be updated after midnight and real-
time updates for "Now" partition were postponed. This was
done to avoid "gaps" in outcome data and "overlappings" in
forecast data.
●Real-time updates ended soon after midnight and were
resumed a few hours later. (That was NOT acceptable.)
Amendment:
●Re-think!
20. Solution - approach three
Decisions:
●Take advantage of 6*24 cycle (as opposed to 7*24)
●Switch dates on Saturdays only
othe "Now" partition had to stretch from Saturday to
Saturday
oall other partitions had to stretch from a Saturday to
another Saturday
●Re-process all time-consuming partitions on Saturday after
switch of date
21. Solution - approach three cont'd
Actions:
●Create logic in Oracle database to do date calculations
"modulo week", i.e. based on Saturday. Logic implemented
as function.
●Rewrite SQL statements for cube partitions so that they
employ the Oracle function (as above) instead of current
date +/- given number of days.
●Reschedule the time consuming updates so they run every
7th day.
23. Lessons learned
●It is possible to build real-time OLAP cubes in MS
technology
●It is possible to make the partitions self-maintaining in
terms of partition boundaries
●The concept need careful engineering as there are pits in
the way.
24. Omitted details
Some details have been omitted:
●the quasi real-time updates are scheduled to occur every
2nd or 3rd minute
●scheduling is not exact, as the Super-job keeps track of
what is to be run and when and executes SSIS packages
based on "scheduled-to-run" state, their priority and a few
other criteria
●the source of data is not a proper star schema, it is rather
an emulation of facts and dimensions by means of data
tables and views in Oracle.