How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and Operational Scenarios

Performance in Denodo 6.0
Pablo Alvarez, Principal Technical Account
Manager

Agenda
1.Debunking the myths of virtual performance
2.Query Optimizer
3.Cache
4.Resource Management
5.Further Reading

3
It is a common assumption that a virtualized solution will be
much slower than a persisted approach via ETL:
1. There is a large amount of data moved through the
network for each query
2. Network transfer is slow
But is this really true?

4
Debunking the myths of virtual performance
1. Complex queries can be solved transferring moderate data volumes when
the right techniques are applied
 Operational queries
 Predicate delegation produces small result sets
 Logical Data Warehouse and Big Data
 Denodo uses characteristics of underlying star schemas to apply
query rewriting rules that maximize delegation to specialized sources
(especially heavy GROUP BY) and minimize data movement
2. Current networks are almost as fast as reading from disk
 10GB and 100GB Ethernet are a commodity

5
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
Customer Dimension
2 M rows
Sales Facts
290 M rows
Items Dimension
400 K rows
* TPC-DS is the de-facto industry
standard benchmark for measuring
the performance of decision support
solutions including, but not limited to,
Big Data systems.
• Denodo has done extensive testing using queries from the standard benchmarking test
TPC-DS* and the following scenario
• The baseline was set using the same queries with all data in a Netezza appliance

6
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
Query Description
Returned
Rows
Time Netezza
Time Denodo
(Federating Oracle,
Netezza & SQL Server)
Optimization Technique
(automatically selected)
Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and
year between 2000 and 2004
5,51 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where
sale price less than current
list price
17,05 K 3.5 sec. 5,2 sec On the fly data movement

7
Performance and optimizations in Denodo 6.0
Focused on 3 core concepts
Dynamic Multi-Source Query Execution Plans
Leverages processing power & architecture of data sources
Dynamic to support ad hoc queries
Uses statistics for cost-based query plans
Selective Materialization
Intelligent Caching of only the most relevant and often used
information
Optimized Resource Management
Smart allocation of resources to handle high concurrency
Throttling to control and mitigate source impact
Resource plans based on rules

8
Performance and optimizations in Denodo 6.0
Comparing optimizations in DV vs ETL
Although Data Virtualization is a data integration platform,
architecturally speaking it is more similar to a RDBMs
Uses relational logic
Metadata is equivalent to that of a database
Enables ad hoc querying
Key difference between ETL engines and DV:
ETL engines are optimized for static bulk movements
Fixed data flows
Data virtualization is optimized for queries
Dynamic execution plan per query
Therefore, the performance architecture presented here
resembles that of a RDBMS

10
Step by Step
Metadata
Query Tree
• Maps query entities (tables, fields) to actual metadata
• Retrieves execution capabilities and restrictions for views involved
in the query
Static
Optimizer
• Query delegation
• SQL rewriting rules (removal of redundant filters, tree pruning, join
reordering, transformation push-up, star-schema rewritings, etc.)
• Data movement query plans
Cost Based
Optimizer
• Picks optimal JOIN methods and orders based on data distribution
statistics, indexes, transfer rates, etc.
Physical
Execution Plan
• Creates the calls to the underlying systems in their corresponding
protocols and dialects (SQL, MDX, WS calls, etc.)
How Dynamic Query Optimizer Works

11
Example: Logical Data Warehouse
Total sales by retailer and product during the last month for the brand ACME
Time Dimension Fact table
(sales) Product Dimension
Retailer
Dimension
EDW MDM
SELECT retailer.name,
product.name,
SUM(sales.amount)
FROM
sales JOIN retailer ON
sales.retailer_fk = retailer.id
JOIN product ON sales.product_fk =
product.id
JOIN time ON sales.time_fk = time.id
WHERE time.date < ADDMONTH(NOW(),-1)
AND product.brand = ‘ACME’
GROUP BY product.name, retailer.name

12
Example: Non-optimized
1,000,000,0
00 rows
JOIN
JOIN
JOIN
GROUP BY
product.name,
retailer.name
100 rows 10 rows 30 rows
10,000,000
rows
SELECT
sales.retailer_fk,
sales.product_fk,
sales.time_fk,
sales.amount
FROM sales
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT
product.name,
product.id
FROM product
WHERE
produc.brand =
‘ACME’
SELECT time.date,
time.id
FROM time
WHERE time.date <
add_months(CURRENT_
TIMESTAMP, -1)

13
Step 1: Applies JOIN reordering to maximize delegation
100,000,000
rows
JOIN
JOIN
100 rows 10 rows
10,000,000
rows
GROUP BY
product.name,
retailer.name
SELECT sales.retailer_fk,
sales.product_fk,
sales.amount
FROM sales JOIN time ON
sales.time_fk = time.id WHERE
time.date <
add_months(CURRENT_TIMESTAMP, -1)
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’

14
Step 2
10,000 rows
JOIN
JOIN
100 rows 10 rows
1,000 rows
GROUP BY
product.name,
retailer.name
Since the JOIN is on foreign keys
(1-to-many), and the GROUP BY is
on attributes from the dimensions,
it applies the partial aggregation
push down optimization
sales.product_fk,
SUM(sales.amount)
time.date <
GROUP BY sales.retailer_fk,
sales.product_fk
SELECT
retailer.name,
retailer.id
FROM retailer
product.id
FROM product
WHERE

15
Step 3
Selects the right JOIN
strategy based on costs for
data volume estimations
1,000 rows
NESTED
JOIN
HASH
JOIN
100 rows10 rows
1,000 rows
GROUP BY
product.name,
retailer.name
sales.product_fk,
SUM(sales.amount)
time.date <
GROUP BY sales.retailer_fk,
sales.product_fk
WHERE product.id IN (1,2,…)
SELECT
retailer.name,
retailer.id
FROM retailer
product.id
FROM product
WHERE

16
The use of Automatic JOIN reordering groups branches that go to the same source to
maximize query delegation and reduce processing in the DV layer
 End users don’t need to worry about the optimal “pairing” of the tables
The Partial Aggregation push-down optimization is key in those scenarios. Based on PK-
FK restrictions, pushes the aggregation (for the PKs) to the DW
 Leverages the processing power of the DW, optimized for these aggregations
 Reduces significantly the data transferred through the network (from 1 b to 10 k)
The Cost-based Optimizer picks the right JOIN strategies based on estimations on data
volumes, existence of indexes, transfer rates, etc.
 Denodo estimates costs in a different way for parallel databases (Vertica, Netezza, Teradata)
than for regular databases to take into consideration the different way those systems operate
(distributed data, parallel processing, different aggregation techniques, etc.)
Summary

17
Pruning of unnecessary JOIN branches (based on 1 to + associations) when the
attributes of the 1-side are not projected
 Relevant for horizontal partitioning and “fat” semantic models when queries do not need
attributes for all the tables
 Unnecessary tables are removed from the query (even for single-source models)
Pruning of UNION branches based on incompatible filters
 Enables detection of unnecessary UNION branches in vertical partitioning scenarios
Automatic data movement
 Creation of temp tables in one of the systems to enable complete delegation of a federated
branch.
 The target source needs to have the “data movement” option enabled for this option to be
taken into account
Other relevant optimization techniques for LDW and Big Data

19
Caching
Sometimes, real time access & federation not a good fit:
 Sources are slow (ex. text files, cloud apps. like Salesforce.com)
 A lot of data processing needed (ex. complex combinations, transformations,
matching, cleansing, etc.)
 Limited access or have to mitigate impact on the sources
For these scenarios, Denodo can replicate just the relevant data in
the cache
Real time vs. caching

20
Caching
Denodo’s cache system is based on an external relational database
 Traditional (Oracle, SLQServer, DB2, MySQL, etc.)
 MPP (Teradata, Netezza, Vertica, Redshift, etc.)
 In-memory storage (Oracle TimesTen, SAP HANA)
Works at view level.
 Allows hybrid access (real-time / cached) of an execution tree
Cache Control (population / maintenance)
 Manually – user initiated at any time
 Time based - using the TTL or the Denodo Scheduler
 Event based - e.g. using JMS messages triggered in the DB
Overview

21
Caching
Denodo offers two different types of cache
 Partial:
 Query-by-query cache
 Useful for caching only the most commonly requested data
 More adequate to represent the capabilities of non-relational sources, like web
services or APIs with input parameters
 Full:
 Similar to the concept of materialized view
 Incrementally updateable at row level to avoid unnecessary full refresh loads
 Offers full push-down capabilities to the source, including group by and join
operations
Caching options

22
Hybrid Performance for SaaS sources
Incremental Queries (Available July 2016)
Merge cached data and fresh data to provide fully up-to-date results with minimum latency
Get Leads
changed / added
since 1:00AM
CACHE
Leads updated
at 1:00AM
Up-to-date Leads
data
1. Salesforce ‘Leads’ data
cached in VDP at 1:00
AM
2. Query needing Leads
data arrives at 11:00 AM
3. Only new/changed leads
are retrieved through
the WAN
4. Response is up-to-date
but query is much faster

24
Resource Management
Advanced Memory Management
 Dynamic data buffers to control source federation with different data retrieval speeds, which
guarantees a low memory footprint
 All operations are memory-constrained to prevent monopolization of resources by a single query.
The constraints are adjustable.
 Swapping data to disk to handle large data sets so as not to overload the memory
 On-the-fly modification of execution plans to prevent exceeding memory thresholds
Server Throttling Mechanisms
 Control settings to limit concurrency (max queries, max. threads…)
 Waiting queues for inbound connections
 Connection pools for data sources

25
Resource Management
Enterprise Resource Manager
 Apply resource restrictions based on a set of rules
 Rules Classify Sessions into Groups (e.g. by user, role, application, source IP…)
 E.g. Sessions from application ‘single customer view’ are assigned to group called ‘high
priority transactional’
 Apply Restrictions for Each Group.
 Change priority, change concurrency settings, change max timeouts, etc

27
Further Reading
Check also the following articles written by our CTO Alberto Pan in our blog:
• Myths in data virtualization performance
• http://www.datavirtualizationblog.com/myths-in-data-virtualization-
performance/
• Performance of Data Virtualization in Logical Data Warehouse scenarios
• http://www.datavirtualizationblog.com/performance-data-virtualization-logical-
data-warehouse-scenarios/
• Physical vs Logical Data Warehouse: the numbers
• http://www.datavirtualizationblog.com/physical-logical-data-warehouse-
performance-numbers/

Thanks!
www.denodo.com info@denodo.com
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical,
including photocopying and microfilm, without prior the written authorization from Denodo Technologies.

How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and Operational Scenarios

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and Operational Scenarios

Similar to How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and Operational Scenarios (20)

More from Denodo

More from Denodo (20)

Recently uploaded

Recently uploaded (20)

How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and Operational Scenarios