Performance Considerations in the
Logical Data Warehouse
Mark Pritchard
Sales Engineering Director UK, Denodo
Challenging the Myths
Logical Data Warehouse Performance
3
What is a Logical Data Warehouse?
A logical data warehouse is a data system that follows the ideas
of traditional EDW (star or snowflake schemas) and includes, in
addition to one (or more) core DWs, data from external sources.
The main objectives are to improve decision making and/or cost
reduction.
4
C. Assumption, Acme Corp
Data Virtualization solutions will be much slower than a
persisted approach via ETL
1. There is a large amount of data moved through the
network for each query
2. Network transfer is slow
…but is this really true?
5
Challenging the Myths of Virtual Performance
Not as much data is moved as you may think!
▪ Complex queries can be solved transferring moderate data volumes when the
right techniques are applied
▪ Operational queries
▪ Predicate delegation produces small result sets
▪ Logical Data Warehouse and Big Data
▪ Denodo uses characteristics of underlying star schemas to apply query rewriting rules
that maximize delegation to specialized sources (especially heavy GROUP BY) and
minimize data movement
▪ Current networks are almost as fast as reading from disk
▪ 10GB and 100GB Ethernet are a commodity
6
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
Denodo has done extensive testing using queries from the standard benchmarking test TPC-DS* and the
following scenario.
• Compares the performance of a federated approach in Denodo with an MPP system where all the
data has been replicated via ETL.
Customer Dim.
2 M rows
Sales Facts
290 M rows
Items Dim.
400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support
solutions including, but not limited to, Big Data systems.
vs.
Sales Facts
290 M rows
Items Dim.
400 K rows
Customer Dim.
2 M rows
7
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
Query Description Returned Rows Time Netezza
Time Denodo (Federated Oracle,
Netezza & SQL Server)
Optimization Technique (automatically
selected)
Total sales by customer 1.99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and year between
2000 and 2004
5.51 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31.35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where sale price less
than current list price
17.05 K 3.5 sec. 5.2 sec On the fly data movement
Performance Optimization
Logical Data Warehouse Performance
9
Performance and Optimizations in Denodo
Comparing optimizations in DV vs ETL
Although Data Virtualization is a data integration platform, architecturally
speaking it is more similar to a RDBMs
Uses relational logic
Metadata is equivalent to that of a database
Enables ad hoc querying
Key difference between ETL engines and DV:
ETL engines are optimized for static bulk movements
Fixed data flows
Data virtualization is optimized for queries
Dynamic execution plan per query
Denodo performance architecture resembles that of a RDBMS
10
Performance and Optimizations in Denodo
Focused on 3 core concepts
Dynamic Multi-Source Query Execution Plans
Leverages processing power & architecture of data sources
Dynamic to support ad hoc queries
Uses statistics for cost-based query plans
Selective Materialization
Intelligent Caching of only the most relevant and often used information
Optimized Resource Management
Smart allocation of resources to handle high concurrency
Throttling to control and mitigate source impact
Resource plans based on rules
Query Optimizer
Performance Optimization
12
Step by Step
Metadata
Query Tree
• Maps query entities (tables, fields) to actual metadata
• Retrieves execution capabilities and restrictions for views involved in the query
Static
Optimizer
• SQL rewriting rules (removal of redundant filters, tree pruning, join reordering,
transformation push-up, star-schema rewritings, etc.)
• Query delegation
• Data movement query plans
Cost Based
Optimizer
• Picks optimal JOIN methods and orders based on data distribution statistics, indexes,
transfer rates, etc.
Physical
Execution Plan
• Creates the calls to the underlying systems in their corresponding protocols and
dialects (SQL, MDX, WS calls, etc.)
How the Dynamic Query Optimizer Works
13
How the Dynamic Query Optimizer Works
Key Optimizations for Logical Data Warehouse Scenarios
Automatic JOIN reordering
▪ Groups branches that go to the same source to maximize query delegation and reduce processing in the DV layer
▪ End users don’t need to worry about the optimal “pairing” of the tables
The Partial Aggregation push-down optimization is key in LDW scenarios. Based on PK-FK restrictions,
pushes the aggregation (for the PKs) to the DW
▪ Leverages the processing power of the DW, optimized for these aggregations
▪ Reduces significantly the data transferred through the network (from 1B to 10K)
The Cost-based Optimizer picks the right JOIN strategies based on estimations on data volumes,
existence of indexes, transfer rates, etc.
▪ Denodo estimates costs in a different way for parallel databases (Vertica, Netezza, Teradata) than for regular
databases to take into consideration the different way those systems operate (distributed data, parallel processing,
different aggregation techniques, etc.)
14
How the Dynamic Query Optimizer Works
Other relevant optimization techniques for LDW and Big Data
Automatic Data Movement
▪ Creation of temp tables in one of the systems to enable complete delegation
▪ Only considered as an option if the target source has the “data movement” option enabled
▪ Use of native bulk load APIs for better performance
Execution Alternatives
▪ If a view exists in more than one system, Denodo can decide in execution time which one to use
▪ The goal is to maximize query delegation depending on the other tables involved in the query
15
How the Dynamic Query Optimizer Works
Other relevant optimization techniques for LDW and Big Data
Optimizations for Virtual Partitioning
Eliminates unnecessary queries and processing based on a pre-execution analysis of
the views and the queries
▪ Pruning of unnecessary JOIN branches
▪ Pruning of unnecessary UNION branches
▪ Push down of JOIN under UNION views
▪ Automatic Data movement for partition scenarios
Caching
Performance Optimization
16
17
Caching
Real time vs. caching
Sometimes, real time access & federation not a good fit:
▪ Sources are slow (ex. text files, cloud apps. like Salesforce.com)
▪ A lot of data processing needed (ex. complex combinations, transformations, matching,
cleansing, etc.)
▪ Limited access or have to mitigate impact on the sources
For these scenarios, Denodo can replicate just the relevant data in the cache
18
Caching
Overview
Denodo’s cache system is based on an external relational database
▪ Traditional (Oracle, SQLServer, DB2, MySQL, etc.)
▪ MPP (Teradata, Netezza, Vertica, Redshift, etc.)
▪ In-memory storage (Oracle TimesTen, SAP HANA)
Works at the view level
▪ Allows hybrid access (real-time / cached) of an execution tree
Cache Control (population / maintenance)
▪ Manually – user initiated at any time
▪ Time based - using the TTL or the Denodo Scheduler
▪ Event based - e.g. using JMS messages triggered in the DB
Thank you!
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written
authorization from Denodo Technologies.
#DenodoDataFest

Performance Considerations in Logical Data Warehouse

  • 1.
    Performance Considerations inthe Logical Data Warehouse Mark Pritchard Sales Engineering Director UK, Denodo
  • 2.
    Challenging the Myths LogicalData Warehouse Performance
  • 3.
    3 What is aLogical Data Warehouse? A logical data warehouse is a data system that follows the ideas of traditional EDW (star or snowflake schemas) and includes, in addition to one (or more) core DWs, data from external sources. The main objectives are to improve decision making and/or cost reduction.
  • 4.
    4 C. Assumption, AcmeCorp Data Virtualization solutions will be much slower than a persisted approach via ETL 1. There is a large amount of data moved through the network for each query 2. Network transfer is slow …but is this really true?
  • 5.
    5 Challenging the Mythsof Virtual Performance Not as much data is moved as you may think! ▪ Complex queries can be solved transferring moderate data volumes when the right techniques are applied ▪ Operational queries ▪ Predicate delegation produces small result sets ▪ Logical Data Warehouse and Big Data ▪ Denodo uses characteristics of underlying star schemas to apply query rewriting rules that maximize delegation to specialized sources (especially heavy GROUP BY) and minimize data movement ▪ Current networks are almost as fast as reading from disk ▪ 10GB and 100GB Ethernet are a commodity
  • 6.
    6 Performance Comparison Logical DataWarehouse vs. Physical Data Warehouse Denodo has done extensive testing using queries from the standard benchmarking test TPC-DS* and the following scenario. • Compares the performance of a federated approach in Denodo with an MPP system where all the data has been replicated via ETL. Customer Dim. 2 M rows Sales Facts 290 M rows Items Dim. 400 K rows * TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems. vs. Sales Facts 290 M rows Items Dim. 400 K rows Customer Dim. 2 M rows
  • 7.
    7 Performance Comparison Logical DataWarehouse vs. Physical Data Warehouse Query Description Returned Rows Time Netezza Time Denodo (Federated Oracle, Netezza & SQL Server) Optimization Technique (automatically selected) Total sales by customer 1.99 M 20.9 sec. 21.4 sec. Full aggregation push-down Total sales by customer and year between 2000 and 2004 5.51 M 52.3 sec. 59.0 sec Full aggregation push-down Total sales by item brand 31.35 K 4.7 sec. 5.0 sec. Partial aggregation push-down Total sales by item where sale price less than current list price 17.05 K 3.5 sec. 5.2 sec On the fly data movement
  • 8.
  • 9.
    9 Performance and Optimizationsin Denodo Comparing optimizations in DV vs ETL Although Data Virtualization is a data integration platform, architecturally speaking it is more similar to a RDBMs Uses relational logic Metadata is equivalent to that of a database Enables ad hoc querying Key difference between ETL engines and DV: ETL engines are optimized for static bulk movements Fixed data flows Data virtualization is optimized for queries Dynamic execution plan per query Denodo performance architecture resembles that of a RDBMS
  • 10.
    10 Performance and Optimizationsin Denodo Focused on 3 core concepts Dynamic Multi-Source Query Execution Plans Leverages processing power & architecture of data sources Dynamic to support ad hoc queries Uses statistics for cost-based query plans Selective Materialization Intelligent Caching of only the most relevant and often used information Optimized Resource Management Smart allocation of resources to handle high concurrency Throttling to control and mitigate source impact Resource plans based on rules
  • 11.
  • 12.
    12 Step by Step Metadata QueryTree • Maps query entities (tables, fields) to actual metadata • Retrieves execution capabilities and restrictions for views involved in the query Static Optimizer • SQL rewriting rules (removal of redundant filters, tree pruning, join reordering, transformation push-up, star-schema rewritings, etc.) • Query delegation • Data movement query plans Cost Based Optimizer • Picks optimal JOIN methods and orders based on data distribution statistics, indexes, transfer rates, etc. Physical Execution Plan • Creates the calls to the underlying systems in their corresponding protocols and dialects (SQL, MDX, WS calls, etc.) How the Dynamic Query Optimizer Works
  • 13.
    13 How the DynamicQuery Optimizer Works Key Optimizations for Logical Data Warehouse Scenarios Automatic JOIN reordering ▪ Groups branches that go to the same source to maximize query delegation and reduce processing in the DV layer ▪ End users don’t need to worry about the optimal “pairing” of the tables The Partial Aggregation push-down optimization is key in LDW scenarios. Based on PK-FK restrictions, pushes the aggregation (for the PKs) to the DW ▪ Leverages the processing power of the DW, optimized for these aggregations ▪ Reduces significantly the data transferred through the network (from 1B to 10K) The Cost-based Optimizer picks the right JOIN strategies based on estimations on data volumes, existence of indexes, transfer rates, etc. ▪ Denodo estimates costs in a different way for parallel databases (Vertica, Netezza, Teradata) than for regular databases to take into consideration the different way those systems operate (distributed data, parallel processing, different aggregation techniques, etc.)
  • 14.
    14 How the DynamicQuery Optimizer Works Other relevant optimization techniques for LDW and Big Data Automatic Data Movement ▪ Creation of temp tables in one of the systems to enable complete delegation ▪ Only considered as an option if the target source has the “data movement” option enabled ▪ Use of native bulk load APIs for better performance Execution Alternatives ▪ If a view exists in more than one system, Denodo can decide in execution time which one to use ▪ The goal is to maximize query delegation depending on the other tables involved in the query
  • 15.
    15 How the DynamicQuery Optimizer Works Other relevant optimization techniques for LDW and Big Data Optimizations for Virtual Partitioning Eliminates unnecessary queries and processing based on a pre-execution analysis of the views and the queries ▪ Pruning of unnecessary JOIN branches ▪ Pruning of unnecessary UNION branches ▪ Push down of JOIN under UNION views ▪ Automatic Data movement for partition scenarios
  • 16.
  • 17.
    17 Caching Real time vs.caching Sometimes, real time access & federation not a good fit: ▪ Sources are slow (ex. text files, cloud apps. like Salesforce.com) ▪ A lot of data processing needed (ex. complex combinations, transformations, matching, cleansing, etc.) ▪ Limited access or have to mitigate impact on the sources For these scenarios, Denodo can replicate just the relevant data in the cache
  • 18.
    18 Caching Overview Denodo’s cache systemis based on an external relational database ▪ Traditional (Oracle, SQLServer, DB2, MySQL, etc.) ▪ MPP (Teradata, Netezza, Vertica, Redshift, etc.) ▪ In-memory storage (Oracle TimesTen, SAP HANA) Works at the view level ▪ Allows hybrid access (real-time / cached) of an execution tree Cache Control (population / maintenance) ▪ Manually – user initiated at any time ▪ Time based - using the TTL or the Denodo Scheduler ▪ Event based - e.g. using JMS messages triggered in the DB
  • 19.
    Thank you! © CopyrightDenodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies. #DenodoDataFest