Performance is a key consideration for organizations looking to implement big data, logical data warehouse, and operational use cases. In this presentation, the technology expert demonstrates the performance aspects of using data virtualization to accelerate the delivery of fast data to end consumers.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/YMPhvE.
2. Agenda
1.Debunking the myths of virtual performance
2.Query Optimizer
3.Cache
4.Resource Management
5.Further Reading
3. 3
It is a common assumption that a virtualized solution will be
much slower than a persisted approach via ETL:
1. There is a large amount of data moved through the
network for each query
2. Network transfer is slow
But is this really true?
4. 4
Debunking the myths of virtual performance
1. Complex queries can be solved transferring moderate data volumes when
the right techniques are applied
Operational queries
Predicate delegation produces small result sets
Logical Data Warehouse and Big Data
Denodo uses characteristics of underlying star schemas to apply
query rewriting rules that maximize delegation to specialized sources
(especially heavy GROUP BY) and minimize data movement
2. Current networks are almost as fast as reading from disk
10GB and 100GB Ethernet are a commodity
5. 5
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
Customer Dimension
2 M rows
Sales Facts
290 M rows
Items Dimension
400 K rows
* TPC-DS is the de-facto industry
standard benchmark for measuring
the performance of decision support
solutions including, but not limited to,
Big Data systems.
• Denodo has done extensive testing using queries from the standard benchmarking test
TPC-DS* and the following scenario
• The baseline was set using the same queries with all data in a Netezza appliance
6. 6
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
Query Description
Returned
Rows
Time Netezza
Time Denodo
(Federating Oracle,
Netezza & SQL Server)
Optimization Technique
(automatically selected)
Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and
year between 2000 and 2004
5,51 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where
sale price less than current
list price
17,05 K 3.5 sec. 5,2 sec On the fly data movement
7. 7
Performance and optimizations in Denodo 6.0
Focused on 3 core concepts
Dynamic Multi-Source Query Execution Plans
Leverages processing power & architecture of data sources
Dynamic to support ad hoc queries
Uses statistics for cost-based query plans
Selective Materialization
Intelligent Caching of only the most relevant and often used
information
Optimized Resource Management
Smart allocation of resources to handle high concurrency
Throttling to control and mitigate source impact
Resource plans based on rules
8. 8
Performance and optimizations in Denodo 6.0
Comparing optimizations in DV vs ETL
Although Data Virtualization is a data integration platform,
architecturally speaking it is more similar to a RDBMs
Uses relational logic
Metadata is equivalent to that of a database
Enables ad hoc querying
Key difference between ETL engines and DV:
ETL engines are optimized for static bulk movements
Fixed data flows
Data virtualization is optimized for queries
Dynamic execution plan per query
Therefore, the performance architecture presented here
resembles that of a RDBMS
10. 10
Step by Step
Metadata
Query Tree
• Maps query entities (tables, fields) to actual metadata
• Retrieves execution capabilities and restrictions for views involved
in the query
Static
Optimizer
• Query delegation
• SQL rewriting rules (removal of redundant filters, tree pruning, join
reordering, transformation push-up, star-schema rewritings, etc.)
• Data movement query plans
Cost Based
Optimizer
• Picks optimal JOIN methods and orders based on data distribution
statistics, indexes, transfer rates, etc.
Physical
Execution Plan
• Creates the calls to the underlying systems in their corresponding
protocols and dialects (SQL, MDX, WS calls, etc.)
How Dynamic Query Optimizer Works
11. 11
How Dynamic Query Optimizer Works
Example: Logical Data Warehouse
Total sales by retailer and product during the last month for the brand ACME
Time Dimension Fact table
(sales) Product Dimension
Retailer
Dimension
EDW MDM
SELECT retailer.name,
product.name,
SUM(sales.amount)
FROM
sales JOIN retailer ON
sales.retailer_fk = retailer.id
JOIN product ON sales.product_fk =
product.id
JOIN time ON sales.time_fk = time.id
WHERE time.date < ADDMONTH(NOW(),-1)
AND product.brand = ‘ACME’
GROUP BY product.name, retailer.name
12. 12
How Dynamic Query Optimizer Works
Example: Non-optimized
1,000,000,0
00 rows
JOIN
JOIN
JOIN
GROUP BY
product.name,
retailer.name
100 rows 10 rows 30 rows
10,000,000
rows
SELECT
sales.retailer_fk,
sales.product_fk,
sales.time_fk,
sales.amount
FROM sales
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT
product.name,
product.id
FROM product
WHERE
produc.brand =
‘ACME’
SELECT time.date,
time.id
FROM time
WHERE time.date <
add_months(CURRENT_
TIMESTAMP, -1)
13. 13
How Dynamic Query Optimizer Works
Step 1: Applies JOIN reordering to maximize delegation
100,000,000
rows
JOIN
JOIN
100 rows 10 rows
10,000,000
rows
GROUP BY
product.name,
retailer.name
SELECT sales.retailer_fk,
sales.product_fk,
sales.amount
FROM sales JOIN time ON
sales.time_fk = time.id WHERE
time.date <
add_months(CURRENT_TIMESTAMP, -1)
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’
14. 14
How Dynamic Query Optimizer Works
Step 2
10,000 rows
JOIN
JOIN
100 rows 10 rows
1,000 rows
GROUP BY
product.name,
retailer.name
Since the JOIN is on foreign keys
(1-to-many), and the GROUP BY is
on attributes from the dimensions,
it applies the partial aggregation
push down optimization
SELECT sales.retailer_fk,
sales.product_fk,
SUM(sales.amount)
FROM sales JOIN time ON
sales.time_fk = time.id WHERE
time.date <
add_months(CURRENT_TIMESTAMP, -1)
GROUP BY sales.retailer_fk,
sales.product_fk
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’
15. 15
How Dynamic Query Optimizer Works
Step 3
Selects the right JOIN
strategy based on costs for
data volume estimations
1,000 rows
NESTED
JOIN
HASH
JOIN
100 rows10 rows
1,000 rows
GROUP BY
product.name,
retailer.name
SELECT sales.retailer_fk,
sales.product_fk,
SUM(sales.amount)
FROM sales JOIN time ON
sales.time_fk = time.id WHERE
time.date <
add_months(CURRENT_TIMESTAMP, -1)
GROUP BY sales.retailer_fk,
sales.product_fk
WHERE product.id IN (1,2,…)
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’
16. 16
How Dynamic Query Optimizer Works
The use of Automatic JOIN reordering groups branches that go to the same source to
maximize query delegation and reduce processing in the DV layer
End users don’t need to worry about the optimal “pairing” of the tables
The Partial Aggregation push-down optimization is key in those scenarios. Based on PK-
FK restrictions, pushes the aggregation (for the PKs) to the DW
Leverages the processing power of the DW, optimized for these aggregations
Reduces significantly the data transferred through the network (from 1 b to 10 k)
The Cost-based Optimizer picks the right JOIN strategies based on estimations on data
volumes, existence of indexes, transfer rates, etc.
Denodo estimates costs in a different way for parallel databases (Vertica, Netezza, Teradata)
than for regular databases to take into consideration the different way those systems operate
(distributed data, parallel processing, different aggregation techniques, etc.)
Summary
17. 17
How Dynamic Query Optimizer Works
Pruning of unnecessary JOIN branches (based on 1 to + associations) when the
attributes of the 1-side are not projected
Relevant for horizontal partitioning and “fat” semantic models when queries do not need
attributes for all the tables
Unnecessary tables are removed from the query (even for single-source models)
Pruning of UNION branches based on incompatible filters
Enables detection of unnecessary UNION branches in vertical partitioning scenarios
Automatic data movement
Creation of temp tables in one of the systems to enable complete delegation of a federated
branch.
The target source needs to have the “data movement” option enabled for this option to be
taken into account
Other relevant optimization techniques for LDW and Big Data
19. 19
Caching
Sometimes, real time access & federation not a good fit:
Sources are slow (ex. text files, cloud apps. like Salesforce.com)
A lot of data processing needed (ex. complex combinations, transformations,
matching, cleansing, etc.)
Limited access or have to mitigate impact on the sources
For these scenarios, Denodo can replicate just the relevant data in
the cache
Real time vs. caching
20. 20
Caching
Denodo’s cache system is based on an external relational database
Traditional (Oracle, SLQServer, DB2, MySQL, etc.)
MPP (Teradata, Netezza, Vertica, Redshift, etc.)
In-memory storage (Oracle TimesTen, SAP HANA)
Works at view level.
Allows hybrid access (real-time / cached) of an execution tree
Cache Control (population / maintenance)
Manually – user initiated at any time
Time based - using the TTL or the Denodo Scheduler
Event based - e.g. using JMS messages triggered in the DB
Overview
21. 21
Caching
Denodo offers two different types of cache
Partial:
Query-by-query cache
Useful for caching only the most commonly requested data
More adequate to represent the capabilities of non-relational sources, like web
services or APIs with input parameters
Full:
Similar to the concept of materialized view
Incrementally updateable at row level to avoid unnecessary full refresh loads
Offers full push-down capabilities to the source, including group by and join
operations
Caching options
22. 22
Hybrid Performance for SaaS sources
Incremental Queries (Available July 2016)
Merge cached data and fresh data to provide fully up-to-date results with minimum latency
Get Leads
changed / added
since 1:00AM
CACHE
Leads updated
at 1:00AM
Up-to-date Leads
data
1. Salesforce ‘Leads’ data
cached in VDP at 1:00
AM
2. Query needing Leads
data arrives at 11:00 AM
3. Only new/changed leads
are retrieved through
the WAN
4. Response is up-to-date
but query is much faster
24. 24
Resource Management
Advanced Memory Management
Dynamic data buffers to control source federation with different data retrieval speeds, which
guarantees a low memory footprint
All operations are memory-constrained to prevent monopolization of resources by a single query.
The constraints are adjustable.
Swapping data to disk to handle large data sets so as not to overload the memory
On-the-fly modification of execution plans to prevent exceeding memory thresholds
Server Throttling Mechanisms
Control settings to limit concurrency (max queries, max. threads…)
Waiting queues for inbound connections
Connection pools for data sources
25. 25
Resource Management
Enterprise Resource Manager
Apply resource restrictions based on a set of rules
Rules Classify Sessions into Groups (e.g. by user, role, application, source IP…)
E.g. Sessions from application ‘single customer view’ are assigned to group called ‘high
priority transactional’
Apply Restrictions for Each Group.
Change priority, change concurrency settings, change max timeouts, etc
27. 27
Further Reading
Check also the following articles written by our CTO Alberto Pan in our blog:
• Myths in data virtualization performance
• http://www.datavirtualizationblog.com/myths-in-data-virtualization-
performance/
• Performance of Data Virtualization in Logical Data Warehouse scenarios
• http://www.datavirtualizationblog.com/performance-data-virtualization-logical-
data-warehouse-scenarios/
• Physical vs Logical Data Warehouse: the numbers
• http://www.datavirtualizationblog.com/physical-logical-data-warehouse-
performance-numbers/