Case study: Quasi real-time OLAP cubes by Ziemowit Jankowski Database Architect
OLAP Cubes - what is it?• Used to quickly analyze and retrieve data from different perspectives• Numeric data• Structured data: o can be represented as numeric values (or sets thereof) accessed by a composite key o each of the parts of the composite key belongs to a well-defined set of values• Facts = numeric values• Dimensions = parts of the composite key• Source = usually a start or snowflake schema in a relational DB (other sources possible)
OLAP Cubes - data sourcesStar schema Snowflake schema
OLAP Facts and dimensions• Every "cell" in an OLAP cube contains numeric data a.k.a "measures".• Every "cell" may contain more than one measure, e.g. forecast and outcome.• Every "cell" has a unique combination of dimension values.
OLAP Cubes - operations• Slice = choose values corresponding to ONE value on one or more dimensions• Dice = choose values corresponding to one slice or a number of consecutive slices on more than 2 dimensions of the cube
OLAP Cubes - operations (contd)• Drill down/up = choose lower/higher level details. Used in context of hierarchical dimensions.• Pivot = rotate the orientation of the data for reporting purposes• Roll-up
OLAP Cubes - refresh methods• Incremental: o possible when cubes grow "outwards", i.e. no "scattered" changes in data o only delta data need to be read o refresh may be fast if delta is small• Full: o possible for all cubes, even when changes are "scattered" all over thedata o all data need to be re-read with every o refresh may take long time (hours)
The situation on hand• Business operating on 24*6 basis (Sun-Fri)• Events from production systems are aggregated into flows and production units• Production figures may be adjusted manually long after production date• Daily production figures are basis for daily forecasts with the simplified formula: forecast(yearX) = production(yearX-1) * trend(yearX) + manualFcastAdjustm• Adjustments in production figures will alter forecast figures• Outcome and forecast should be stored in MS OLAP cubes as per software architecture demands• The system should simplify comparisons between forecast and outcome figures
Software• Source of data: o Relational database o Oracle 10g database o extensive use of PL/SQL in database• Destination of data: o OLAP cubes - MS SQL Server Analysis Services (version 2005 and 2008)• Other software: o MS SQL Server database
QUESTIONCan we get almost real-time reports from MS OLAP cubes?ANSWERYES! The answer lies in "cube partitioning".
Cube partitioning - the basics• Cube partitions may be updated independently• Cube partitions may not overlap (duplicate values may occur)• Time is a good dimension to partition on
MS OLAP cube partitioning - details• Every cube partition has its own query to define the data set fetched from the data source• The SQL statements define the non-overlapping data sets
How to partition? - theory• Partitions with different lengths and different update frequencies: o current data = very small partition, very short update times, updated often o "not very current" data = a bit larger partition, longer update times, updated less often o historical data = large partition, long update times, updated seldom• Operation 24x6 delivers the "seldom" window
How to partition? - theory contd• One cube for both forecast and outcome
Solution - approach oneDecisions: • Cubes partitioned on date boundaries • MOLAP cubes (for better queryperformance) • Use SSIS to populate cubes o dimensions populated by incremental processing o facts populated by full processing o jobs for historical data must be run after midnight to compensate for date changeActions: • Cubes built • SSIS deployed inside SQL Server (and not filesystem) • SSIS set up as scheduled database jobs
Did it work?No!Malfunctions:• Simultaneous updates of cube partitions could lead to deadlocks• Deadlocks left cube partitions in unprocessed stateAmendment: • Cube partitions must not be updated simultaneously
Solution - approach twoDecisions: • Cube processing must be ONE partition at a time • Scheduling done by SSIS "super package": o SQL Server table contains approx. frequency and package names o "super package" executes SSIS packages as indicated by the tableActions: • Scheduling table created • "Super package" created to be self-modifying
Did it work?Not really!Malfunctions:• Historical data had to be updated after midnight and real- time updates for "Now" partition were postponed. This was done to avoid "gaps" in outcome data and "overlappings" in forecast data.• Real-time updates ended soon after midnight and were resumed a few hours later. (That was NOT acceptable.)Amendment: • Re-think!
Solution - approach threeDecisions: • Take advantage of 6*24 cycle (as opposed to 7*24) • Switch dates on Saturdays only o the "Now" partition had to stretch from Saturday to Saturday o all other partitions had to stretch from a Saturday to another Saturday • Re-process all time-consuming partitions on Saturday after switch of date
Solution - approach three contdActions: • Create logic in Oracle database to do date calculations "modulo week", i.e. based on Saturday. Logic implemented as function. • Rewrite SQL statements for cube partitions so that they employ the Oracle function (as above) instead of current date +/- given number of days. • Reschedule the time consuming updates so they run every 7th day.
Lessons learned• It is possible to build real-time OLAP cubes in MS technology• It is possible to make the partitions self-maintaining in terms of partition boundaries• The concept need careful engineering as there are pits in the way.
Omitted detailsSome details have been omitted: • the quasi real-time updates are scheduled to occur every 2nd or 3rd minute • scheduling is not exact, as the Super-job keeps track of what is to be run and when and executes SSIS packages based on "scheduled-to-run" state, their priority and a few other criteria • the source of data is not a proper star schema, it is rather an emulation of facts and dimensions by means of data tables and views in Oracle.