Case study- Real-time OLAP Cubes


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Case study- Real-time OLAP Cubes

  1. 1. Case study: Quasi real-time OLAP cubes by Ziemowit Jankowski Database Architect
  2. 2. OLAP Cubes - what is it? ●Used to quickly analyze and retrieve data from different perspectives ●Numeric data ●Structured data: ocan be represented as numeric values (or sets thereof) accessed by a composite key oeach of the parts of the composite key belongs to a well-defined set of values ●Facts = numeric values ●Dimensions = parts of the composite key ●Source = usually a start or snowflake schema in a relational DB (other sources possible)
  3. 3. OLAP Cubes - data sources Star schema Snowflake schema Production outcome Product Date District Sub- district Production outcome Product Date District Year Month Day of week
  4. 4. OLAP Facts and dimensions ●Every "cell" in an OLAP cube contains numeric data a.k.a "measures". ●Every "cell" may contain more than one measure, e.g. forecast and outcome. ●Every "cell" has a unique combination of dimension values. District Product
  5. 5. OLAP Cubes - operations ●Slice = choose values corresponding to ONE value on one or more dimensions ●Dice = choose values corresponding to one slice or a number of consecutive slices on more than 2 dimensions of the cube
  6. 6. OLAP Cubes - operations (cont'd) ●Drill down/up = choose lower/higher level details. Used in context of hierarchical dimensions. ●Pivot = rotate the orientation of the data for reporting purposes ●Roll-up
  7. 7. OLAP Cubes - refresh methods ●Incremental: opossible when cubes grow "outwards", i.e. no "scattered" changes in data oonly delta data need to be read orefresh may be fast if delta is small ●Full: opossible for all cubes, even when changes are "scattered" all over thedata oall data need to be re-read with every orefresh may take long time (hours) Time Cube data Updates Time Cube data New data
  8. 8. The situation on hand ●Business operating on 24*6 basis (Sun-Fri) ●Events from production systems are aggregated into flows and production units ●Production figures may be adjusted manually long after production date ●Daily production figures are basis for daily forecasts with the simplified formula: forecast(yearX) = production(yearX-1) * trend(yearX) + manualFcastAdjustm ●Adjustments in production figures will alter forecast figures ●Outcome and forecast should be stored in MS OLAP cubes as per software architecture demands ●The system should simplify comparisons between forecast and outcome figures
  9. 9. Software ●Source of data: oRelational database oOracle 10g database oextensive use of PL/SQL in database ●Destination of data: oOLAP cubes - MS SQL Server Analysis Services (version 2005 and 2008) ●Other software: oMS SQL Server database
  10. 10. QUESTION Can we get almost real-time reports from MS OLAP cubes? ANSWER YES! The answer lies in "cube partitioning".
  11. 11. Cube partitioning - the basics ●Cube partitions may be updated independently ●Cube partitions may not overlap (duplicate values may occur) ●Time is a good dimension to partition on Time
  12. 12. MS OLAP cube partitioning - details ●Every cube partition has its own query to define the data set fetched from the data source ●The SQL statements define the non-overlapping data sets Relational DB Partitioned cube SQL query a SQL query b SQL query c SQL query d Data source
  13. 13. MS OLAP cube partitioning - details Relational DB Partitioned cube SQL query a SQL query b SQL query c SQL query d Data source Dim Dim Dim Facts Small amount of data Large amount of data
  14. 14. How to partition? - theory ●Partitions with different lengths and different update frequencies: ocurrent data = very small partition, very short update times, updated often o"not very current" data = a bit larger partition, longer update times, updated less often ohistorical data = large partition, long update times, updated seldom ●Operation 24x6 delivers the "seldom" window
  15. 15. How to partition? - theory cont'd ●One cube for both forecast and outcome Forecast measure Outcome measure One year into the future Now Last month Last yearHistory
  16. 16. Solution - approach one Decisions: ●Cubes partitioned on date boundaries ●MOLAP cubes (for better queryperformance) ●Use SSIS to populate cubes odimensions populated by incremental processing ofacts populated by full processing ojobs for historical data must be run after midnight to compensate for date change Actions: ●Cubes built ●SSIS deployed inside SQL Server (and not filesystem) ●SSIS set up as scheduled database jobs
  17. 17. Did it work? No! Malfunctions: ●Simultaneous updates of cube partitions could lead to deadlocks ●Deadlocks left cube partitions in unprocessed state Amendment: ●Cube partitions must not be updated simultaneously
  18. 18. Solution - approach two Decisions: ●Cube processing must be ONE partition at a time ●Scheduling done by SSIS "super package": oSQL Server table contains approx. frequency and package names o"super package" executes SSIS packages as indicated by the table Actions: ●Scheduling table created ●"Super package" created to be self-modifying
  19. 19. Did it work? Not really! Malfunctions: ●Historical data had to be updated after midnight and real- time updates for "Now" partition were postponed. This was done to avoid "gaps" in outcome data and "overlappings" in forecast data. ●Real-time updates ended soon after midnight and were resumed a few hours later. (That was NOT acceptable.) Amendment: ●Re-think!
  20. 20. Solution - approach three Decisions: ●Take advantage of 6*24 cycle (as opposed to 7*24) ●Switch dates on Saturdays only othe "Now" partition had to stretch from Saturday to Saturday oall other partitions had to stretch from a Saturday to another Saturday ●Re-process all time-consuming partitions on Saturday after switch of date
  21. 21. Solution - approach three cont'd Actions: ●Create logic in Oracle database to do date calculations "modulo week", i.e. based on Saturday. Logic implemented as function. ●Rewrite SQL statements for cube partitions so that they employ the Oracle function (as above) instead of current date +/- given number of days. ●Reschedule the time consuming updates so they run every 7th day.
  22. 22. Did it work? Yes! Malfunctions: ●None, really.
  23. 23. Lessons learned ●It is possible to build real-time OLAP cubes in MS technology ●It is possible to make the partitions self-maintaining in terms of partition boundaries ●The concept need careful engineering as there are pits in the way.
  24. 24. Omitted details Some details have been omitted: ●the quasi real-time updates are scheduled to occur every 2nd or 3rd minute ●scheduling is not exact, as the Super-job keeps track of what is to be run and when and executes SSIS packages based on "scheduled-to-run" state, their priority and a few other criteria ●the source of data is not a proper star schema, it is rather an emulation of facts and dimensions by means of data tables and views in Oracle.