SlideShare a Scribd company logo
1 of 24
Case study:
Quasi real-time OLAP cubes
by Ziemowit Jankowski
Database Architect
OLAP Cubes - what is it?
●Used to quickly analyze and retrieve data from different
perspectives
●Numeric data
●Structured data:
ocan be represented as numeric values (or sets
thereof) accessed by a composite key
oeach of the parts of the composite key belongs to a
well-defined set of values
●Facts = numeric values
●Dimensions = parts of the composite key
●Source = usually a start or snowflake schema in a
relational DB (other sources possible)
OLAP Cubes - data sources
Star schema Snowflake schema
Production
outcome
Product
Date
District
Sub-
district
Production
outcome
Product
Date
District
Year
Month
Day of week
OLAP Facts and dimensions
●Every "cell" in an OLAP cube contains
numeric data a.k.a "measures".
●Every "cell" may contain more than
one measure, e.g. forecast and
outcome.
●Every "cell" has a unique
combination of dimension values.
District
Product
OLAP Cubes - operations
●Slice = choose values corresponding to
ONE value on one or more dimensions
●Dice = choose values corresponding to
one slice or a number of consecutive
slices on more than 2 dimensions of
the cube
OLAP Cubes - operations (cont'd)
●Drill down/up = choose lower/higher
level details. Used in context of
hierarchical dimensions.
●Pivot = rotate the orientation of the
data for reporting purposes
●Roll-up
OLAP Cubes - refresh methods
●Incremental:
opossible when cubes grow
"outwards", i.e. no "scattered"
changes in data
oonly delta data need to be read
orefresh may be fast if delta is small
●Full:
opossible for all cubes, even when
changes are "scattered" all over
thedata
oall data need to be re-read with
every
orefresh may take long time (hours)
Time
Cube data
Updates
Time
Cube data
New data
The situation on hand
●Business operating on 24*6 basis (Sun-Fri)
●Events from production systems are aggregated into flows
and production units
●Production figures may be adjusted manually long after
production date
●Daily production figures are basis for daily forecasts with
the simplified formula:
forecast(yearX) = production(yearX-1) * trend(yearX) + manualFcastAdjustm
●Adjustments in production figures will alter forecast figures
●Outcome and forecast should be stored in MS OLAP cubes as
per software architecture demands
●The system should simplify comparisons between forecast
and outcome figures
Software
●Source of data:
oRelational database
oOracle 10g database
oextensive use of PL/SQL in database
●Destination of data:
oOLAP cubes - MS SQL Server Analysis Services (version
2005 and 2008)
●Other software:
oMS SQL Server database
QUESTION
Can we get almost real-time reports from MS OLAP cubes?
ANSWER
YES! The answer lies in "cube partitioning".
Cube partitioning - the basics
●Cube partitions may be updated independently
●Cube partitions may not overlap (duplicate values may
occur)
●Time is a good dimension to partition on
Time
MS OLAP cube partitioning - details
●Every cube partition has its own query to define the data
set fetched from the data source
●The SQL statements define the non-overlapping data sets
Relational DB
Partitioned cube
SQL query a SQL query b SQL query c SQL query d
Data source
MS OLAP cube partitioning - details
Relational DB
Partitioned cube
SQL query a SQL query b SQL query c SQL query d
Data source
Dim Dim Dim
Facts
Small amount of
data
Large amount of
data
How to partition? - theory
●Partitions with different lengths and different update
frequencies:
ocurrent data = very small partition, very short update
times, updated often
o"not very current" data = a bit larger partition, longer
update times, updated less often
ohistorical data = large partition, long update times,
updated seldom
●Operation 24x6 delivers the "seldom" window
How to partition? - theory cont'd
●One cube for both forecast and outcome
Forecast measure
Outcome measure
One year
into the future
Now
Last
month
Last yearHistory
Solution - approach one
Decisions:
●Cubes partitioned on date boundaries
●MOLAP cubes (for better queryperformance)
●Use SSIS to populate cubes
odimensions populated by incremental processing
ofacts populated by full processing
ojobs for historical data must be run after midnight to
compensate for date change
Actions:
●Cubes built
●SSIS deployed inside SQL Server (and not filesystem)
●SSIS set up as scheduled database jobs
Did it work?
No!
Malfunctions:
●Simultaneous updates of cube partitions could lead to
deadlocks
●Deadlocks left cube partitions in unprocessed state
Amendment:
●Cube partitions must not be updated simultaneously
Solution - approach two
Decisions:
●Cube processing must be ONE partition at a time
●Scheduling done by SSIS "super package":
oSQL Server table contains approx. frequency and package
names
o"super package" executes SSIS packages as indicated by
the table
Actions:
●Scheduling table created
●"Super package" created to be self-modifying
Did it work?
Not really!
Malfunctions:
●Historical data had to be updated after midnight and real-
time updates for "Now" partition were postponed. This was
done to avoid "gaps" in outcome data and "overlappings" in
forecast data.
●Real-time updates ended soon after midnight and were
resumed a few hours later. (That was NOT acceptable.)
Amendment:
●Re-think!
Solution - approach three
Decisions:
●Take advantage of 6*24 cycle (as opposed to 7*24)
●Switch dates on Saturdays only
othe "Now" partition had to stretch from Saturday to
Saturday
oall other partitions had to stretch from a Saturday to
another Saturday
●Re-process all time-consuming partitions on Saturday after
switch of date
Solution - approach three cont'd
Actions:
●Create logic in Oracle database to do date calculations
"modulo week", i.e. based on Saturday. Logic implemented
as function.
●Rewrite SQL statements for cube partitions so that they
employ the Oracle function (as above) instead of current
date +/- given number of days.
●Reschedule the time consuming updates so they run every
7th day.
Did it work?
Yes!
Malfunctions:
●None, really.
Lessons learned
●It is possible to build real-time OLAP cubes in MS
technology
●It is possible to make the partitions self-maintaining in
terms of partition boundaries
●The concept need careful engineering as there are pits in
the way.
Omitted details
Some details have been omitted:
●the quasi real-time updates are scheduled to occur every
2nd or 3rd minute
●scheduling is not exact, as the Super-job keeps track of
what is to be run and when and executes SSIS packages
based on "scheduled-to-run" state, their priority and a few
other criteria
●the source of data is not a proper star schema, it is rather
an emulation of facts and dimensions by means of data
tables and views in Oracle.

More Related Content

What's hot

Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series DatabasePramit Choudhary
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniquesLars Albertsson
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightDatabricks
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit
 
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group MeetupPresto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group MeetupWojciech Biela
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceDatabricks
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueDatabricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...Spark Summit
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for ScyllaScyllaDB
 
Traveloka's data journey — Traveloka data meetup #2
Traveloka's data journey — Traveloka data meetup #2Traveloka's data journey — Traveloka data meetup #2
Traveloka's data journey — Traveloka data meetup #2Traveloka
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Streaming Analytics @ Uber
Streaming Analytics @ UberStreaming Analytics @ Uber
Streaming Analytics @ UberXiang Fu
 

What's hot (20)

Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group MeetupPresto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for Scylla
 
Traveloka's data journey — Traveloka data meetup #2
Traveloka's data journey — Traveloka data meetup #2Traveloka's data journey — Traveloka data meetup #2
Traveloka's data journey — Traveloka data meetup #2
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Streaming Analytics @ Uber
Streaming Analytics @ UberStreaming Analytics @ Uber
Streaming Analytics @ Uber
 

Viewers also liked

Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)
Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)
Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)Issac Buenrostro
 
Case Study Real Time Olap Cubes
Case Study Real Time Olap CubesCase Study Real Time Olap Cubes
Case Study Real Time Olap Cubesmister_zed
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastoreKishore Gopalakrishna
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemShirshanka Das
 
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016Carl Steinbach
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Cosmin Lehene
 
OLAP OnLine Analytical Processing
OLAP OnLine Analytical ProcessingOLAP OnLine Analytical Processing
OLAP OnLine Analytical ProcessingWalid Elbadawy
 
Multidimensional Database Design & Architecture
Multidimensional Database Design & ArchitectureMultidimensional Database Design & Architecture
Multidimensional Database Design & Architecturehasanshan
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017Sudhir Tonse
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at UberSudhir Tonse
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEZalpa Rathod
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data AnalyticsAnkur Bansal
 

Viewers also liked (16)

Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)
 
Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)
Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)
Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)
 
Case Study Real Time Olap Cubes
Case Study Real Time Olap CubesCase Study Real Time Olap Cubes
Case Study Real Time Olap Cubes
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastore
 
Pro_Tools_Tier_2
Pro_Tools_Tier_2Pro_Tools_Tier_2
Pro_Tools_Tier_2
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
 
OLAP OnLine Analytical Processing
OLAP OnLine Analytical ProcessingOLAP OnLine Analytical Processing
OLAP OnLine Analytical Processing
 
Multidimensional Database Design & Architecture
Multidimensional Database Design & ArchitectureMultidimensional Database Design & Architecture
Multidimensional Database Design & Architecture
 
OLAP
OLAPOLAP
OLAP
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at Uber
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Uber's Business Model
Uber's Business ModelUber's Business Model
Uber's Business Model
 

Similar to Quasi Real-Time OLAP Cubes Using Partitioning

HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010Cloudera, Inc.
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBencht_ivanov
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems researchVasia Kalavri
 
Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]Hiral Patel
 
Choosing data warehouse considerations
Choosing data warehouse considerationsChoosing data warehouse considerations
Choosing data warehouse considerationsAseem Bansal
 
The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by El...
The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by El...The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by El...
The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by El...Neo4j
 
Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)NerdWalletHQ
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedShubham Tagra
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Lucas Jellema
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedShubham Tagra
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
Data Centric HPC for Numerical Weather Forecasting
Data Centric HPC for Numerical Weather ForecastingData Centric HPC for Numerical Weather Forecasting
Data Centric HPC for Numerical Weather ForecastingJames Arnold Faeldon
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Sql server tips from the field
Sql server tips from the fieldSql server tips from the field
Sql server tips from the fieldJoAnna Cheshire
 
Performance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresPerformance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresJitendra Singh
 

Similar to Quasi Real-Time OLAP Cubes Using Partitioning (20)

HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
 
Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]
 
Choosing data warehouse considerations
Choosing data warehouse considerationsChoosing data warehouse considerations
Choosing data warehouse considerations
 
The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by El...
The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by El...The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by El...
The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by El...
 
Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Data Centric HPC for Numerical Weather Forecasting
Data Centric HPC for Numerical Weather ForecastingData Centric HPC for Numerical Weather Forecasting
Data Centric HPC for Numerical Weather Forecasting
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Sql server tips from the field
Sql server tips from the fieldSql server tips from the field
Sql server tips from the field
 
Performance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresPerformance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and Underscores
 
OOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with ParallelOOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with Parallel
 

Quasi Real-Time OLAP Cubes Using Partitioning

  • 1. Case study: Quasi real-time OLAP cubes by Ziemowit Jankowski Database Architect
  • 2. OLAP Cubes - what is it? ●Used to quickly analyze and retrieve data from different perspectives ●Numeric data ●Structured data: ocan be represented as numeric values (or sets thereof) accessed by a composite key oeach of the parts of the composite key belongs to a well-defined set of values ●Facts = numeric values ●Dimensions = parts of the composite key ●Source = usually a start or snowflake schema in a relational DB (other sources possible)
  • 3. OLAP Cubes - data sources Star schema Snowflake schema Production outcome Product Date District Sub- district Production outcome Product Date District Year Month Day of week
  • 4. OLAP Facts and dimensions ●Every "cell" in an OLAP cube contains numeric data a.k.a "measures". ●Every "cell" may contain more than one measure, e.g. forecast and outcome. ●Every "cell" has a unique combination of dimension values. District Product
  • 5. OLAP Cubes - operations ●Slice = choose values corresponding to ONE value on one or more dimensions ●Dice = choose values corresponding to one slice or a number of consecutive slices on more than 2 dimensions of the cube
  • 6. OLAP Cubes - operations (cont'd) ●Drill down/up = choose lower/higher level details. Used in context of hierarchical dimensions. ●Pivot = rotate the orientation of the data for reporting purposes ●Roll-up
  • 7. OLAP Cubes - refresh methods ●Incremental: opossible when cubes grow "outwards", i.e. no "scattered" changes in data oonly delta data need to be read orefresh may be fast if delta is small ●Full: opossible for all cubes, even when changes are "scattered" all over thedata oall data need to be re-read with every orefresh may take long time (hours) Time Cube data Updates Time Cube data New data
  • 8. The situation on hand ●Business operating on 24*6 basis (Sun-Fri) ●Events from production systems are aggregated into flows and production units ●Production figures may be adjusted manually long after production date ●Daily production figures are basis for daily forecasts with the simplified formula: forecast(yearX) = production(yearX-1) * trend(yearX) + manualFcastAdjustm ●Adjustments in production figures will alter forecast figures ●Outcome and forecast should be stored in MS OLAP cubes as per software architecture demands ●The system should simplify comparisons between forecast and outcome figures
  • 9. Software ●Source of data: oRelational database oOracle 10g database oextensive use of PL/SQL in database ●Destination of data: oOLAP cubes - MS SQL Server Analysis Services (version 2005 and 2008) ●Other software: oMS SQL Server database
  • 10. QUESTION Can we get almost real-time reports from MS OLAP cubes? ANSWER YES! The answer lies in "cube partitioning".
  • 11. Cube partitioning - the basics ●Cube partitions may be updated independently ●Cube partitions may not overlap (duplicate values may occur) ●Time is a good dimension to partition on Time
  • 12. MS OLAP cube partitioning - details ●Every cube partition has its own query to define the data set fetched from the data source ●The SQL statements define the non-overlapping data sets Relational DB Partitioned cube SQL query a SQL query b SQL query c SQL query d Data source
  • 13. MS OLAP cube partitioning - details Relational DB Partitioned cube SQL query a SQL query b SQL query c SQL query d Data source Dim Dim Dim Facts Small amount of data Large amount of data
  • 14. How to partition? - theory ●Partitions with different lengths and different update frequencies: ocurrent data = very small partition, very short update times, updated often o"not very current" data = a bit larger partition, longer update times, updated less often ohistorical data = large partition, long update times, updated seldom ●Operation 24x6 delivers the "seldom" window
  • 15. How to partition? - theory cont'd ●One cube for both forecast and outcome Forecast measure Outcome measure One year into the future Now Last month Last yearHistory
  • 16. Solution - approach one Decisions: ●Cubes partitioned on date boundaries ●MOLAP cubes (for better queryperformance) ●Use SSIS to populate cubes odimensions populated by incremental processing ofacts populated by full processing ojobs for historical data must be run after midnight to compensate for date change Actions: ●Cubes built ●SSIS deployed inside SQL Server (and not filesystem) ●SSIS set up as scheduled database jobs
  • 17. Did it work? No! Malfunctions: ●Simultaneous updates of cube partitions could lead to deadlocks ●Deadlocks left cube partitions in unprocessed state Amendment: ●Cube partitions must not be updated simultaneously
  • 18. Solution - approach two Decisions: ●Cube processing must be ONE partition at a time ●Scheduling done by SSIS "super package": oSQL Server table contains approx. frequency and package names o"super package" executes SSIS packages as indicated by the table Actions: ●Scheduling table created ●"Super package" created to be self-modifying
  • 19. Did it work? Not really! Malfunctions: ●Historical data had to be updated after midnight and real- time updates for "Now" partition were postponed. This was done to avoid "gaps" in outcome data and "overlappings" in forecast data. ●Real-time updates ended soon after midnight and were resumed a few hours later. (That was NOT acceptable.) Amendment: ●Re-think!
  • 20. Solution - approach three Decisions: ●Take advantage of 6*24 cycle (as opposed to 7*24) ●Switch dates on Saturdays only othe "Now" partition had to stretch from Saturday to Saturday oall other partitions had to stretch from a Saturday to another Saturday ●Re-process all time-consuming partitions on Saturday after switch of date
  • 21. Solution - approach three cont'd Actions: ●Create logic in Oracle database to do date calculations "modulo week", i.e. based on Saturday. Logic implemented as function. ●Rewrite SQL statements for cube partitions so that they employ the Oracle function (as above) instead of current date +/- given number of days. ●Reschedule the time consuming updates so they run every 7th day.
  • 23. Lessons learned ●It is possible to build real-time OLAP cubes in MS technology ●It is possible to make the partitions self-maintaining in terms of partition boundaries ●The concept need careful engineering as there are pits in the way.
  • 24. Omitted details Some details have been omitted: ●the quasi real-time updates are scheduled to occur every 2nd or 3rd minute ●scheduling is not exact, as the Super-job keeps track of what is to be run and when and executes SSIS packages based on "scheduled-to-run" state, their priority and a few other criteria ●the source of data is not a proper star schema, it is rather an emulation of facts and dimensions by means of data tables and views in Oracle.