Five In-depth Technology and Architecture Sessions
on Data Virtualization
Session 2: Performance
Today’s Speaker
■ Paul Moxon
Senior Director, Product Management
Architect-to-Architect Series
■ Series of five webinars over next 2 months
■ Deeper look into Denodo Platform
■ Architectural Overview
■ Performance (today’s session)
■ Scalability
■ Data Discovery and Governance
■ Security
Denodo Express
■ Denodo Express
■ Free to Download
■ Fully functioning Data Virtualization Platform
■ Single user, supports common data sources
■ Many of the same capabilities of Denodo
Platform
■ Performance, Data Discovery, Governance,
internal Security, Publishing, …
Performance – Architecture Modules
Performance – Architecture Modules
Performance – Optimizer, etc.
■ Optimizer
■ The Optimizer applies state-of-the-art
optimization techniques to relational and non-
relational sources.
■ Query Plan Generator
■ The Plan Generator is in charge of generating
possible execution plans for the query and
selecting the optimum one.
■ Execution Engine
■ Responsible for executing the selected query
plan, executing the necessary sub-queries on
the sources (or collecting data from cache as
appropriate) and integrating the results to
generate the global response.
Performance Optimization
■ Advanced Query Optimization
■ Cost and Source Constraint Based Query Plans
■ Query Delegation
■ Automatic Query Rewriting
■ Join Optimizations
■ Data Movement
■ Asynchronous Multi-threaded Processing
■ Server Throttling Mechanisms
■ Scalability
■ Caching
■ Multiple configuration modes – full or partial
Static vs. Dynamic Optimization
■ Static optimization
■ Takes place before query is executed
■ Rewrite query in more optimal way
■ Push-down delegation
■ Optimize query by – where possible – pushing down
sub-trees to underlying data source
■ Delegate functions to underlying data source
■ Dynamic optimization
■ Use statistics and indices to estimate costs of
alternative execution plans
■ Select Join methods and Join ordering
Cost-based Optimization
■ Objective – select best execution method
for each operation
■ Estimate query costs based on:
■ View statistics
■ No. of rows, row size, for each field: max value, min
value, no. of different values, …
■ View indices
■ Available indices, type of indices (clustered, hash, …)
■ Data source I/O information
■ Block size, blocks/read operation, data transfer rate, …
Source Constraint Optimization
■ Denodo Platform optimization has to work
across multiple diverse data source types
■ Not just relational databases
■ Not all data sources have same capabilities
■ Recognize and optimize for constraints in
underlying data sources
■ e.g. MySQL can be ordered for Merge join…but a
delimited file cannot
Statistics Gathering
Query Delegation
■ Objective – Push the processing to the data
■ Utilize power and optimizations of underlying
data sources
■ Especially relational databases and data warehouses
■ Minimize expensive data movement
■ Delegation mechanisms
■ Vendor specific SQL dialect
■ Function delegation
■ Configurable by data source
■ Delegate SQL operations
■ e.g. Join, Union, Group By, Order By, etc.
Automatic Query Rewriting
■ Objective – Rewrite query in a more optimal
manner before the query is executed
■ Static optimization technique
■ Typical optimizations:
■ Simplify partitioned unions
■ Remove redundant sub-views
■ Transform outer joins to inner joins
■ Static join reordering to maximize delegation
Simplify Partitioned Unions
Select * from Sales_Product where region=‘NA'
North
America
EMEA
Sales_NA Product_EMEA
North
America
Product_NA
EMEA
Sales_EMEA
U U
|><|
S S S S
region=‘NA' region=‘NA'region=‘EMEA' region=‘EMEA'
Join cannot be delegated
Simplify Partitioned Unions (Cont’d)
Select * from Sales_Product where region=‘NA'
North
America
Sales_NA
North
America
Product_NA
U U
|><|
S S
region=‘NA' region=‘NA'
Join can be delegated
Transform Outer Joins to Inner Joins
As a.iinc_id = c.pinc_id ∴ c.pinc_id cannot be null
DS2
internet_inc
DS3
phone_inc
DS1
Internet_inc
||><|
|><|
S
b c
a.iinc_id = c.pinc_id
a
Transform Outer Joins to Inner Joins
As a.iinc_id = c.pinc_id ∴ c.pinc_id cannot be null
DS2
internet_inc
DS3
phone_inc
DS1
Internet_inc
||><|
|><|
S
b c
a.iinc_id = c.pinc_id
a
The left outer is equivalent to an inner join
|><|
Join Optimizations
■ Multiple Join options:
■ Merge
■ Nested
■ Nested Parallel
■ Hash
■ Optimizer automatically selects based on
statistics and source capabilities
■ e.g. when using databases joining two large
datasets, Merge Join is preferred
■ e.g. if one dataset is significantly larger, use
Nested Join
Join Optimizations (Cont’d)
■ You can override the optimizer
Data Movement
■ Typically used when one dataset is significantly
smaller and aggregations performed on joined
data
1. Execute query in DS1
and fetch its data
2. Create a temporary table in DS2
and insert data from step 1
3. When step 2 is completed, execute
the JOIN in DS2 and return the results
to the DV layer
DS1
DS2
Query Plans
■ Optimizer calculates cost of multiple plans and
selects ‘best’ plan
■ Cost estimates:
1. Traverse query tree top-down looking for
‘interesting’ patterns
• e.g. ‘GROUP BY region’ can execute faster if rows arrive
ordered by ‘region’
2. Estimate costs of sub-queries on data sources
• Use source statistics and constraints
3. Traverse tree bottom-up to calculate costs for each
node
• Choose execution with minimum cost
• Remember ‘interesting’ patterns (overall cost vs. node cost)
Other Optimization Techniques
■ Asynchronous Multi-threaded Processing
■ Execute multiple queries in parallel
■ Server Throttling Mechanisms
■ Controls to limit concurrency
■ Waiting queues for inbound connections
■ Connection pools for data sources
■ Swapping data to disk to handle large datasets
Caching
■ Caching – for slow sources and protect
operational data sources
■ Caching enabled at view level
■ Enables mixed mode query plans
■ Caching modes
■ Full – all data in cache
■ Partial – query-by-query
■ Manual refresh or automated refresh
Q&A
Data Virtualization – Next Steps
Move forward at your own pace
 Download Denodo Express –
The fastest way to Data Virtualization
 Denodo Community: Documents, Videos, Tutorials, more.
 Attend Architect-to-Architect Series
 Performance
 Scalability
Move forward with one of our Data
Virtualization experts
 Phone: (+1) 877-556-2531 (NA)
 Phone: (+44) (0)20 7869 8053 (EMEA)
 Email: info@denodo.com | www.denodo.com
 Data Discovery and Governance
 Security
Five In-depth Technology and Architecture Sessions
on Data Virtualization
Thank You!
Next Session
Session 3
Denodo Platform: Scalability

Denodo Data Virtualization Platform Architecture: Performance (session 2 from Architect to Architect webinar series)

  • 1.
    Five In-depth Technologyand Architecture Sessions on Data Virtualization Session 2: Performance
  • 2.
    Today’s Speaker ■ PaulMoxon Senior Director, Product Management
  • 3.
    Architect-to-Architect Series ■ Seriesof five webinars over next 2 months ■ Deeper look into Denodo Platform ■ Architectural Overview ■ Performance (today’s session) ■ Scalability ■ Data Discovery and Governance ■ Security
  • 4.
    Denodo Express ■ DenodoExpress ■ Free to Download ■ Fully functioning Data Virtualization Platform ■ Single user, supports common data sources ■ Many of the same capabilities of Denodo Platform ■ Performance, Data Discovery, Governance, internal Security, Publishing, …
  • 5.
  • 6.
  • 7.
    Performance – Optimizer,etc. ■ Optimizer ■ The Optimizer applies state-of-the-art optimization techniques to relational and non- relational sources. ■ Query Plan Generator ■ The Plan Generator is in charge of generating possible execution plans for the query and selecting the optimum one. ■ Execution Engine ■ Responsible for executing the selected query plan, executing the necessary sub-queries on the sources (or collecting data from cache as appropriate) and integrating the results to generate the global response.
  • 8.
    Performance Optimization ■ AdvancedQuery Optimization ■ Cost and Source Constraint Based Query Plans ■ Query Delegation ■ Automatic Query Rewriting ■ Join Optimizations ■ Data Movement ■ Asynchronous Multi-threaded Processing ■ Server Throttling Mechanisms ■ Scalability ■ Caching ■ Multiple configuration modes – full or partial
  • 9.
    Static vs. DynamicOptimization ■ Static optimization ■ Takes place before query is executed ■ Rewrite query in more optimal way ■ Push-down delegation ■ Optimize query by – where possible – pushing down sub-trees to underlying data source ■ Delegate functions to underlying data source ■ Dynamic optimization ■ Use statistics and indices to estimate costs of alternative execution plans ■ Select Join methods and Join ordering
  • 10.
    Cost-based Optimization ■ Objective– select best execution method for each operation ■ Estimate query costs based on: ■ View statistics ■ No. of rows, row size, for each field: max value, min value, no. of different values, … ■ View indices ■ Available indices, type of indices (clustered, hash, …) ■ Data source I/O information ■ Block size, blocks/read operation, data transfer rate, …
  • 11.
    Source Constraint Optimization ■Denodo Platform optimization has to work across multiple diverse data source types ■ Not just relational databases ■ Not all data sources have same capabilities ■ Recognize and optimize for constraints in underlying data sources ■ e.g. MySQL can be ordered for Merge join…but a delimited file cannot
  • 12.
  • 13.
    Query Delegation ■ Objective– Push the processing to the data ■ Utilize power and optimizations of underlying data sources ■ Especially relational databases and data warehouses ■ Minimize expensive data movement ■ Delegation mechanisms ■ Vendor specific SQL dialect ■ Function delegation ■ Configurable by data source ■ Delegate SQL operations ■ e.g. Join, Union, Group By, Order By, etc.
  • 14.
    Automatic Query Rewriting ■Objective – Rewrite query in a more optimal manner before the query is executed ■ Static optimization technique ■ Typical optimizations: ■ Simplify partitioned unions ■ Remove redundant sub-views ■ Transform outer joins to inner joins ■ Static join reordering to maximize delegation
  • 15.
    Simplify Partitioned Unions Select* from Sales_Product where region=‘NA' North America EMEA Sales_NA Product_EMEA North America Product_NA EMEA Sales_EMEA U U |><| S S S S region=‘NA' region=‘NA'region=‘EMEA' region=‘EMEA' Join cannot be delegated
  • 16.
    Simplify Partitioned Unions(Cont’d) Select * from Sales_Product where region=‘NA' North America Sales_NA North America Product_NA U U |><| S S region=‘NA' region=‘NA' Join can be delegated
  • 17.
    Transform Outer Joinsto Inner Joins As a.iinc_id = c.pinc_id ∴ c.pinc_id cannot be null DS2 internet_inc DS3 phone_inc DS1 Internet_inc ||><| |><| S b c a.iinc_id = c.pinc_id a
  • 18.
    Transform Outer Joinsto Inner Joins As a.iinc_id = c.pinc_id ∴ c.pinc_id cannot be null DS2 internet_inc DS3 phone_inc DS1 Internet_inc ||><| |><| S b c a.iinc_id = c.pinc_id a The left outer is equivalent to an inner join |><|
  • 19.
    Join Optimizations ■ MultipleJoin options: ■ Merge ■ Nested ■ Nested Parallel ■ Hash ■ Optimizer automatically selects based on statistics and source capabilities ■ e.g. when using databases joining two large datasets, Merge Join is preferred ■ e.g. if one dataset is significantly larger, use Nested Join
  • 20.
    Join Optimizations (Cont’d) ■You can override the optimizer
  • 21.
    Data Movement ■ Typicallyused when one dataset is significantly smaller and aggregations performed on joined data 1. Execute query in DS1 and fetch its data 2. Create a temporary table in DS2 and insert data from step 1 3. When step 2 is completed, execute the JOIN in DS2 and return the results to the DV layer DS1 DS2
  • 22.
    Query Plans ■ Optimizercalculates cost of multiple plans and selects ‘best’ plan ■ Cost estimates: 1. Traverse query tree top-down looking for ‘interesting’ patterns • e.g. ‘GROUP BY region’ can execute faster if rows arrive ordered by ‘region’ 2. Estimate costs of sub-queries on data sources • Use source statistics and constraints 3. Traverse tree bottom-up to calculate costs for each node • Choose execution with minimum cost • Remember ‘interesting’ patterns (overall cost vs. node cost)
  • 23.
    Other Optimization Techniques ■Asynchronous Multi-threaded Processing ■ Execute multiple queries in parallel ■ Server Throttling Mechanisms ■ Controls to limit concurrency ■ Waiting queues for inbound connections ■ Connection pools for data sources ■ Swapping data to disk to handle large datasets
  • 24.
    Caching ■ Caching –for slow sources and protect operational data sources ■ Caching enabled at view level ■ Enables mixed mode query plans ■ Caching modes ■ Full – all data in cache ■ Partial – query-by-query ■ Manual refresh or automated refresh
  • 25.
  • 26.
    Data Virtualization –Next Steps Move forward at your own pace  Download Denodo Express – The fastest way to Data Virtualization  Denodo Community: Documents, Videos, Tutorials, more.  Attend Architect-to-Architect Series  Performance  Scalability Move forward with one of our Data Virtualization experts  Phone: (+1) 877-556-2531 (NA)  Phone: (+44) (0)20 7869 8053 (EMEA)  Email: info@denodo.com | www.denodo.com  Data Discovery and Governance  Security
  • 27.
    Five In-depth Technologyand Architecture Sessions on Data Virtualization Thank You! Next Session Session 3 Denodo Platform: Scalability