Data Virtualisation for Data Architects
October 2020
Enabling digital
transformation by
connecting expression,
experience and
enablement
600+
Consultants
100+
Active
Clients
ASX
Listed
Company
4
Australian
Locations
▪ RXP Data Management Framework
End-to-end Information Management framework for the entire data lifecycle.
Data Virtualisation is
relevant in these areas
Denodo Data Virtualization
5
Gartner – The Rise of Logical Architectures
This is a Second Major Cycle of Analytical Consolidation
Operational Application
Operational Application
Operational Application
IoT Data
Other NewData
Operational
Application
Operational
Application
Cube
Operational
Application
Cube
? Operational Application
Operational Application
Operational Application
IoT Data
Other NewData
1980s
Pre EDW
1990s
EDW
2010s2000s
Post EDW
Time
LDW
Operational
Application
Operational
Application
Operational
Application
Data
Warehouse
Data
Warehouse
Data
Lake
?
Logical Data
Warehouse
Data Warehouse
Data Lake
Marts
ODS
Staging/Ingest
Unified analysis
› Consolidated data
› "Collect the data"
› Single server, multiple nodes
› More analysis than any
one server can provide
©2018 Gartner, Inc.
Unified analysis
› Logically consolidated view of all data
› "Connect and collect"
› Multiple servers, of multiple nodes
› More analysis than any one system can provide
ID: 342254
Fragmented/
nonexistent analysis
› Multiple sources
› Multiple structured sources
Fragmented analysis
› "Collect the data" (Into
› different repositories)
› New data types,
› processing, requirements
› Uncoordinated views
6
Gartner – The Rise of Logical Architectures
This is a Second Major Cycle of Analytical Consolidation
Operational Application
Operational Application
Operational Application
IoT Data
Other NewData
Operational
Application
Operational
Application
Cube
Operational
Application
Cube
? Operational Application
Operational Application
Operational Application
IoT Data
Other NewData
1980s
Pre EDW
1990s
EDW
2010s2000s
Post EDW
Time
LDW
Operational
Application
Operational
Application
Operational
Application
Data
Warehouse
Data
Warehouse
Data
Lake
?
Unified analysis
› Consolidated data
› "Collect the data"
› Single server, multiple nodes
› More analysis than any
one server can provide
©2018 Gartner, Inc.
Unified analysis
› Logically consolidated view of all data
› "Connect and collect"
› Multiple servers, of multiple nodes
› More analysis than any one system can provide
ID: 342254
Fragmented/
nonexistent analysis
› Multiple sources
› Multiple structured sources
Fragmented analysis
› "Collect the data" (Into
› different repositories)
› New data types,
› processing, requirements
› Uncoordinated views
Operational Application
Operational Application
Operational Application
IoT Data
Other NewData
Logical Data
Warehouse
Data Warehouse
Data Lake
Marts
ODS
Staging/Ingest
Data
Virtualization
√ Improved Time to Market by 50 to 90%
√ Improved Report Consistency
√ Reduce Duplication of Data
√ Improve Transparency
√ Reduced development Cost
√ Future Proof the architecture against
technology changes
DATA CONSUMERS
DISPARATE DATA SOURCES
SQL Queries
(JDBC, ODBC, ADO.NET)
Web Services
(SOAP, REST, OData)
Web-based catalog
& search
Secure delivery
(SSL/TLS)
DATA CONSUMERS
MPP Processing
Relational Cache
Corporate Security
Monitoring & Auditing
Metadata
Repository
Execution Engine
& Optimizer
Data Virtualization as a Data Access Layer
DATA VIRTUALIZATION
Consume
Combine
2
3
Connect
1
DATA CONSUMERS
DISPARATE DATA SOURCES
SQL Queries
(JDBC, ODBC, ADO.NET)
Web Services
(SOAP, REST, OData)
Web-based catalog
& search
Secure delivery
(SSL/TLS)
DATA CONSUMERS
Data Virtualization in Action
Consume
Combine
2
3
Connect
1
Base/Raw views
Standardized
views
Customer Product Order
Business viewsFinance Operations Sales
Less Structured
Operational
Each Layer of Views
provides more refined
Single Views of Truth
Platform Demonstration
10
Demo Scenario
▪ Historical sales data offloaded to Hadoop
cluster for cheaper storage
▪ Marketing campaigns managed in an external
cloud app
▪ Country is part of the customer details table,
stored in the DW
Sources
Combine,
Transform
&
Integrate
Consume
Base View
Source
Abstraction
join
group by state
join
Sales Campaign Customer
SaaS solution
How effective are our marketing Campaigns?
11
Personas
Denodo Developer
Business User
& BI Analyst Data Scientist
Application-to-Application
Administration &
Operations
Unified Web Administration: Central Web Portal
Entry point for all
users to all Denodo
Environments.
SSO to all tools
with Kerberos, SAML
or OAuth
Data Virtualization:
1. Enables data re-use reducing costs & increasing
collaboration
2. Unifies disparate data sources in real-time
3. Supports self-service & data discovery
4. Centralises governance & security of enterprise
data assets
Key Takeaways
Q&A
15
What is the optimizer doing?
SELECT c.state, AVG(s.amount)
FROM customer c JOIN sales s
ON c.id = s.customer_id
GROUP BY c.state
Sales Customer
join
group by
Sales Customer
Create temp
table
join
group by
Option 1?
Option 2? Option 3?
Temp_Customer
Customer and Sales are in different sources.
What is the best execution plan?
Naïve Strategy Temporary Data Movement
300 M 2 M
2 M
50 M
Sales Customer
join
group by ID
Group by
state
Partial Aggregation Pushdown
2 M
2 M
‘Cost’ ~302 M ‘Cost’ ~52 M ‘Cost’ ~4 M
16
Why is this so important?
SELECT c.name, AVG(s.amount)
FROM customer c JOIN sales s
ON c.id = s.customer_id
GROUP BY c.state
How Denodo works compared with other federation engines
System Execution Time Data Transferred Optimization Technique
Denodo 9 sec. 4 M Aggregation push-down
Others 125 sec. 302 M None: full scan
300 M 2 M
Sales Customer
join
group by
2 M
2 M
Sales Customer
join
group by ID
Group by
state
To maximize push
down to the EDW
the aggregation is
split in 2 steps:
• 1st by customerID
• 2nd by state
This significantly
reduces network
Traffic and processing
In Denodo
17
Denodo Performance Strategies
• Post-processing and Federation in the DV engine
• Delegation
▪ Process as much as possible in the data sources
• Temporary Tables
▪ Automatically move data to the biggest data source to optimize the execution
• Summaries
▪ Based on the query the Denodo optimizer can use a “summary” for accelerating the execution
• MPP Integration
▪ Move processing to an external MPP system on the fly
• Caching
▪ Persist data beforehand in a relational database
Next Steps
Join us again for the final topic in this 3-part series :
Webinar 3 will be held 12:30 Thursday 19th November
▪ Topic: Data Virtualisation for Business Consumption
Contact information:
Adrian Bridge
Principal Consultant
RXP Group
0417 875 919
adrian.bridge@rxpservices.com
Katrina Briedis
Sales Engineering
Denodo
+61 450 499 440
kbriedis@denodo.com
Thank you

Data Virtualization for Data Architects (Australia)

  • 1.
    Data Virtualisation forData Architects October 2020
  • 2.
    Enabling digital transformation by connectingexpression, experience and enablement 600+ Consultants 100+ Active Clients ASX Listed Company 4 Australian Locations
  • 3.
    ▪ RXP DataManagement Framework End-to-end Information Management framework for the entire data lifecycle. Data Virtualisation is relevant in these areas
  • 4.
  • 5.
    5 Gartner – TheRise of Logical Architectures This is a Second Major Cycle of Analytical Consolidation Operational Application Operational Application Operational Application IoT Data Other NewData Operational Application Operational Application Cube Operational Application Cube ? Operational Application Operational Application Operational Application IoT Data Other NewData 1980s Pre EDW 1990s EDW 2010s2000s Post EDW Time LDW Operational Application Operational Application Operational Application Data Warehouse Data Warehouse Data Lake ? Logical Data Warehouse Data Warehouse Data Lake Marts ODS Staging/Ingest Unified analysis › Consolidated data › "Collect the data" › Single server, multiple nodes › More analysis than any one server can provide ©2018 Gartner, Inc. Unified analysis › Logically consolidated view of all data › "Connect and collect" › Multiple servers, of multiple nodes › More analysis than any one system can provide ID: 342254 Fragmented/ nonexistent analysis › Multiple sources › Multiple structured sources Fragmented analysis › "Collect the data" (Into › different repositories) › New data types, › processing, requirements › Uncoordinated views
  • 6.
    6 Gartner – TheRise of Logical Architectures This is a Second Major Cycle of Analytical Consolidation Operational Application Operational Application Operational Application IoT Data Other NewData Operational Application Operational Application Cube Operational Application Cube ? Operational Application Operational Application Operational Application IoT Data Other NewData 1980s Pre EDW 1990s EDW 2010s2000s Post EDW Time LDW Operational Application Operational Application Operational Application Data Warehouse Data Warehouse Data Lake ? Unified analysis › Consolidated data › "Collect the data" › Single server, multiple nodes › More analysis than any one server can provide ©2018 Gartner, Inc. Unified analysis › Logically consolidated view of all data › "Connect and collect" › Multiple servers, of multiple nodes › More analysis than any one system can provide ID: 342254 Fragmented/ nonexistent analysis › Multiple sources › Multiple structured sources Fragmented analysis › "Collect the data" (Into › different repositories) › New data types, › processing, requirements › Uncoordinated views Operational Application Operational Application Operational Application IoT Data Other NewData Logical Data Warehouse Data Warehouse Data Lake Marts ODS Staging/Ingest Data Virtualization √ Improved Time to Market by 50 to 90% √ Improved Report Consistency √ Reduce Duplication of Data √ Improve Transparency √ Reduced development Cost √ Future Proof the architecture against technology changes
  • 7.
    DATA CONSUMERS DISPARATE DATASOURCES SQL Queries (JDBC, ODBC, ADO.NET) Web Services (SOAP, REST, OData) Web-based catalog & search Secure delivery (SSL/TLS) DATA CONSUMERS MPP Processing Relational Cache Corporate Security Monitoring & Auditing Metadata Repository Execution Engine & Optimizer Data Virtualization as a Data Access Layer DATA VIRTUALIZATION Consume Combine 2 3 Connect 1
  • 8.
    DATA CONSUMERS DISPARATE DATASOURCES SQL Queries (JDBC, ODBC, ADO.NET) Web Services (SOAP, REST, OData) Web-based catalog & search Secure delivery (SSL/TLS) DATA CONSUMERS Data Virtualization in Action Consume Combine 2 3 Connect 1 Base/Raw views Standardized views Customer Product Order Business viewsFinance Operations Sales Less Structured Operational Each Layer of Views provides more refined Single Views of Truth
  • 9.
  • 10.
    10 Demo Scenario ▪ Historicalsales data offloaded to Hadoop cluster for cheaper storage ▪ Marketing campaigns managed in an external cloud app ▪ Country is part of the customer details table, stored in the DW Sources Combine, Transform & Integrate Consume Base View Source Abstraction join group by state join Sales Campaign Customer SaaS solution How effective are our marketing Campaigns?
  • 11.
    11 Personas Denodo Developer Business User &BI Analyst Data Scientist Application-to-Application Administration & Operations
  • 12.
    Unified Web Administration:Central Web Portal Entry point for all users to all Denodo Environments. SSO to all tools with Kerberos, SAML or OAuth
  • 13.
    Data Virtualization: 1. Enablesdata re-use reducing costs & increasing collaboration 2. Unifies disparate data sources in real-time 3. Supports self-service & data discovery 4. Centralises governance & security of enterprise data assets Key Takeaways
  • 14.
  • 15.
    15 What is theoptimizer doing? SELECT c.state, AVG(s.amount) FROM customer c JOIN sales s ON c.id = s.customer_id GROUP BY c.state Sales Customer join group by Sales Customer Create temp table join group by Option 1? Option 2? Option 3? Temp_Customer Customer and Sales are in different sources. What is the best execution plan? Naïve Strategy Temporary Data Movement 300 M 2 M 2 M 50 M Sales Customer join group by ID Group by state Partial Aggregation Pushdown 2 M 2 M ‘Cost’ ~302 M ‘Cost’ ~52 M ‘Cost’ ~4 M
  • 16.
    16 Why is thisso important? SELECT c.name, AVG(s.amount) FROM customer c JOIN sales s ON c.id = s.customer_id GROUP BY c.state How Denodo works compared with other federation engines System Execution Time Data Transferred Optimization Technique Denodo 9 sec. 4 M Aggregation push-down Others 125 sec. 302 M None: full scan 300 M 2 M Sales Customer join group by 2 M 2 M Sales Customer join group by ID Group by state To maximize push down to the EDW the aggregation is split in 2 steps: • 1st by customerID • 2nd by state This significantly reduces network Traffic and processing In Denodo
  • 17.
    17 Denodo Performance Strategies •Post-processing and Federation in the DV engine • Delegation ▪ Process as much as possible in the data sources • Temporary Tables ▪ Automatically move data to the biggest data source to optimize the execution • Summaries ▪ Based on the query the Denodo optimizer can use a “summary” for accelerating the execution • MPP Integration ▪ Move processing to an external MPP system on the fly • Caching ▪ Persist data beforehand in a relational database
  • 18.
    Next Steps Join usagain for the final topic in this 3-part series : Webinar 3 will be held 12:30 Thursday 19th November ▪ Topic: Data Virtualisation for Business Consumption Contact information: Adrian Bridge Principal Consultant RXP Group 0417 875 919 adrian.bridge@rxpservices.com Katrina Briedis Sales Engineering Denodo +61 450 499 440 kbriedis@denodo.com
  • 19.