DATA VIRTUALIZATION PACKED LUNCH
WEBINAR SERIES
Sessions Covering Key Data Integration Challenges
Solved with Data Virtualization
Data Virtualization: An Introduction
Michael Dickson
Sales Engineer, Denodo
Paul Moxon
VP Data Architectures & Chief Evangelist, Denodo
Agenda
1. Data Virtualization: An Introduction
2. Data Virtualization Platforms – Key Capabilities
3. Product Demo
4. Key Takeaways
5. Q&A
6. Next Steps
Data Virtualization: An Introduction
4
Data Integration – “The Way We Were…”
5
Operational
Data Stores
Staging Area Data Warehouse Data Marts Analytics and
Reporting
ETLETLETL
Data Integration – A Modern Data Ecosystem
6
The Data Integration Challenge
7
Manually access different
systems
IT responds with point-to-
point data integration
Takes too long to get
answers to business users
MarketingSales ExecutiveSupport
Database
Apps
Warehouse Cloud
Big Data
Documents AppsNo SQL
“Data bottlenecks create business bottlenecks.”
– Create a Road Map For A Real-time, Agile, Self-Service Data
Platform, Forrester Research, Dec 16, 2015
8
The Data Integration Challenge
It is difficult to integrate numerous
on-premises and cloud data sources.
Traditional tools cannot integrate streaming
data and data-at-rest in real time.
It is difficult to maintain consistent data
access and governance policies across data
siloes.
Traditional data integration is extremely
resource intensive.
The Solution – A Data Abstraction Layer
9
Abstracts access to
disparate data sources
Acts as a single repository
(virtual)
Makes data available in
real-time to consumers
DATA ABSTRACTION LAYER
“Enterprise architects must revise their data
architecture to meet the demand for fast data.”
– Create a Road Map For A Real-time, Agile, Self-Service Data
Platform, Forrester Research, Dec 16, 2015
Data Virtualization
10
“Data virtualization integrates disparate data sources in real time or near-real time
to meet demands for analytics and transactional data.”
– Create a Road Map For A Real-time, Agile, Self-Service Data Platform, Forrester Research, Dec 16, 2015
Publishes
the data to applications
Combines
related data into views
Connects
to disparate data sources
2
3
1
Data Virtualization Reference Architecture
11
Source: “Gartner Market Guide for data virtualization – 2016”
Data virtualization technology can be used to create
virtualized and integrated views of data in memory (rather
than executing data movement and physically storing
integrated views in a target data structure), and provides a
layer of abstraction above the physical implementation of
data.
What Data Virtualization is Not!
• It is not ETL
• If you want to replicate data from ‘A’ to ‘B’…use an ETL tool – it’s what they are designed for
• It is not Data Visualization ( Note the ‘s’)
• It complements visualization and reporting tools (e.g. Tableau)
• It is not a database
• Data Virtualization Platforms don’t store the data…it’s retrieved from the data sources on
demand
• It has many capabilities such as governance, metadata management, security, etc.
• It will work with specialized tools in these areas
• It’s great for service-based architectures
• But be wary of event-driven architectures…use an ESB (or similar) for this
13
− Gartner, Predicts 2017: Data Distribution and Complexity Drive Information Infrastructure
Modernization, Ted Friedman et al.
By 2018, organizations with data virtualization capabilities
will spend 40% less on building and managing data
integration processes for connecting distributed data assets.
14
Data Virtualization Platforms –
Key Capabilities
15
16
Five Essential Capabilities of Data Virtualization
4. Self-service data services
5. Centralized metadata,
security & governance
1. Data abstraction
2. Zero replication, zero relocation
3. Real-time information
17
1. Data abstraction
Abstracts access to disparate data
sources.
Acts as a single virtual repository.
Abstracts data complexities like
location, format, protocols
…hides data complexity for ease of data access by business
Enterprise architects must revise their data
architecture to meet the demand for fast
data.”
– Create a Road Map For A Real-time, Agile, Self-
Service Data Platform, Forrester Research
18
2. Zero replication, zero relocation
…reduces development time and overall TCO
The Denodo Platform enables us to build and
deliver data services, to our internal and external
consumers, within a day instead of the 1 – 2
weeks it would take with ETL.”
– Manager, DrillingInfo
Leaves the data at its source; extracts
only what is needed, on demand.
Diminishes the need for effort-intensive
ETL processes.
Eliminates unnecessary data
redundancy.
19
3. Real-time information
Provisions data in real-time to consumers
Creates real-time logical views of data
across many data sources.
Supports transformations and quality
functions without the latency,
redundancy, and rigidity of legacy
approaches
…enables timely decision-making
Data virtualization integrates disparate data sources in real
time or near-real time to meet demands for analytics and
transactional data.”
– Create a Road Map For A Real-time, Agile, Self-Service Data
Platform, Forrester Research, Dec 16, 2015
20
4. Self-service data services
Facilitates access to all data, both internal and
external
Enables creation of universal semantic models
reflecting business taxonomy
Connects data silos to provide best available
information to drive business decisions
…enables information discovery and self-service
Impressively quick turn around time to "unlock“ data from
additional siloes and from legacy systems - Few vendors (if
any) can compete with Denodo's support of the Restful
/OData standard - both to provide data (northbound) and
to access data from the sources (southbound).”
– Business Analyst, Swiss Re
21
5. Centralized metadata, security & governance
Abstracts data source security models and enables
single-point security and governance.
Extends single-point control across cloud and on-
premises architectures
Provides multiple forms of metadata (technical,
business, operational) to facilitate understanding of
data.
…simplifies data security, privacy, audit
Our Denodo rollout was one of the easiest and most successful
rollouts of critical enterprise software I have seen. It was
successful in handling our initial, security, use case
immediately, and has since shown a strong ability to cover
additional use cases, in particular acting as a Data Abstraction
Layer via it's web service functionality.”
– Enterprise Architect, Asurion
22
Denodo ‘Solution’ Categories
Customer Centricity / MDM
✓ Complete View of Customer
Data Services
✓ Data as a Service
✓ Data Marketplace
✓ Data Services
✓ Application and Data Migration
Cloud Solutions
✓ Cloud Modernization
✓ Cloud Analytics
✓ Hybrid Data Fabric
Data Governance
✓ GRC
✓ GDPR
✓ Data Privacy / Masking
BI and Analytics
✓ Self-Service Analytics
✓ Logical Data Warehouse
✓ Enterprise Data Fabric
Big Data
✓ Logical Data Lake
✓ Data Warehouse Offloading
✓ IoT Analytics
Product Demonstration
Data Virtualization – An Introduction
23
Sales Engineer, Denodo
Michael Dickson
24
Demo Architecture
What’s the impact of a new
marketing campaign for each
country?
▪ Historical sales data offloaded to
Hadoop cluster for cheaper storage
▪ Marketing campaigns managed in an
external cloud app
▪ Country is part of the customer
details table, stored in the DW
Sources
Combine,
Transform
&
Integrate
Consume
Base View
Source
Abstraction
join
group by state
join
Sales Campaign Customer
Demo
25
26
What is the optimizer doing?
SELECT c.state, AVG(s.amount)
FROM customer c JOIN sales s
ON c.id = s.customer_id
GROUP BY c.state
Sales Customer
join
group by
Sales Customer
join
group by ID
Group by
state
Sales Customer
Create temp
table
join
group by
Temp_Customer
Partial Aggregation PushdownNaïve Strategy Temporary Data Movement
300 M 2 M
2 M
2 M
2 M
50
SELECT c.id, amount
FROM
(SELECT s.customer_id,
SUM(amount) amount
FROM sales s
GROUP BY s.customer_id) s_agg
JOIN Customer c
ON (c.id = s_agg.customer_id)
27
Why is this so important?
SELECT c.name, AVG(s.amount)
FROM customer c JOIN sales s
ON c.id = s.customer_id
GROUP BY c.state
How Denodo works compared with other federation engines
System Execution Time Data Transferred Optimization Technique
Denodo 9 sec. 4 M Aggregation push-down
Others 125 sec. 302 M None: full scan
300 M 2 M
Sales Customer
join
group by
2 M
2 M
Sales Customer
join
group by ID
Group by
state
To maximize push
down to the EDW
the aggregation is
split in 2 steps:
• 1st by customerID
• 2nd by state
This significantly
reduces network
Traffic and processing
In Denodo
28
Massive Parallel Processing: Example
2M rows
(sales by customer)
Customer
(2M rows)
Sales
(300 million rows)
group by
customer ID
SELECT c.name, AVG(s.amount)
FROM customer c JOIN sales s
ON c.id = s.customer_id
GROUP BY c.name
join
Group by
name
Similar to the previous query, but now
aggregating by customer name.
What changes?
Partial Aggregation
push down
Maximizes source processing
Reduces network traffic
Swapping to Disk
The aggregation by customer name
produces a larger result set (2M)
that exceeds the memory quota.
Denodo will swap to disk to perform
the intermediate calculation
Serial Calculation
Denodo will perform the calculation
of the aggregation in serial, one row
after another.
With a larger volume, this now becomes
the execution bottleneck
Before MPP Integration
29
join
Group by ZIP
join
Group by ZIP
Massive Parallel Processing: Example
2M rows
(sales by customer)
Customer
(2M rows)
System Execution Time Optimization Techniques
Others ~ 10 min Basic
No MPP 43 sec Aggregation push-down
With MPP 11 sec Aggregation push-down + MPP integration (Impala 8 nodes)
Sales
(300 million rows)
join
Group by ZIP
1. Partial Aggregation
push down
Maximizes source processing
Reduces network traffic
3. On-demand data transfer
For SQL-on-Hadoop systems,
Denodo automatically generates
and upload Parquet files
4. Integration with local
and pre-cached data
The engine detects when data
Is cached or a is native table
in the MPP
2. Integrated with Cost Based Optimizer
Based on data volume estimation and
the cost of these particular operations,
the CBO can decide to move all or part
Of the execution tree to the MPP
5. Fast parallel execution
Support for Spark, Presto and Impala
For fast analytical processing in
inexpensive Hadoop-based solutions
With MPP Integration
group by
customer ID
Key Takeaways
30
Key Takeaways
31
FIRST
Takeaway
Data Virtualization is a key technology when building a modern
data architecture
SECOND
Takeaway
It provides flexibility and agility and reduces the time to deliver
data to the business by up to 10X
THIRD
Takeaway
Data Virtualization hides the complexity of a constantly changing
data infrastructure from the users
FOURTH
Takeaway
In doing so, it allows you to introduce new technologies, formats,
protocols, etc. without causing user disruption
FIFTH
Takeaway
Beware! Not all Data Virtualization platforms are equal…compare
them against the ‘5 criteria’
Q&A
Next steps
Download Denodo Express:
www.denodoexpress.com
Access Denodo Platform in the Cloud!
30 day FREE trial available!
Denodo for Azure:
www.denodo.com/TrialAzure/PackedLunch
Denodo for AWS: www.denodo.com/TrialAWS/PackedLunch
Next session
From Single Purpose to Multi Purpose
Data Lakes - Broadening End Users
Thursday, August 16, 2018
Paul Moxon
VP Data Architectures & Chief Evangelist, Denodo
Thank you!
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and
microfilm, without prior the written authorization from Denodo Technologies.

Why Data Virtualization? An Introduction

  • 1.
    DATA VIRTUALIZATION PACKEDLUNCH WEBINAR SERIES Sessions Covering Key Data Integration Challenges Solved with Data Virtualization
  • 2.
    Data Virtualization: AnIntroduction Michael Dickson Sales Engineer, Denodo Paul Moxon VP Data Architectures & Chief Evangelist, Denodo
  • 3.
    Agenda 1. Data Virtualization:An Introduction 2. Data Virtualization Platforms – Key Capabilities 3. Product Demo 4. Key Takeaways 5. Q&A 6. Next Steps
  • 4.
  • 5.
    Data Integration –“The Way We Were…” 5 Operational Data Stores Staging Area Data Warehouse Data Marts Analytics and Reporting ETLETLETL
  • 6.
    Data Integration –A Modern Data Ecosystem 6
  • 7.
    The Data IntegrationChallenge 7 Manually access different systems IT responds with point-to- point data integration Takes too long to get answers to business users MarketingSales ExecutiveSupport Database Apps Warehouse Cloud Big Data Documents AppsNo SQL “Data bottlenecks create business bottlenecks.” – Create a Road Map For A Real-time, Agile, Self-Service Data Platform, Forrester Research, Dec 16, 2015
  • 8.
    8 The Data IntegrationChallenge It is difficult to integrate numerous on-premises and cloud data sources. Traditional tools cannot integrate streaming data and data-at-rest in real time. It is difficult to maintain consistent data access and governance policies across data siloes. Traditional data integration is extremely resource intensive.
  • 9.
    The Solution –A Data Abstraction Layer 9 Abstracts access to disparate data sources Acts as a single repository (virtual) Makes data available in real-time to consumers DATA ABSTRACTION LAYER “Enterprise architects must revise their data architecture to meet the demand for fast data.” – Create a Road Map For A Real-time, Agile, Self-Service Data Platform, Forrester Research, Dec 16, 2015
  • 10.
    Data Virtualization 10 “Data virtualizationintegrates disparate data sources in real time or near-real time to meet demands for analytics and transactional data.” – Create a Road Map For A Real-time, Agile, Self-Service Data Platform, Forrester Research, Dec 16, 2015 Publishes the data to applications Combines related data into views Connects to disparate data sources 2 3 1
  • 11.
  • 12.
    Source: “Gartner MarketGuide for data virtualization – 2016” Data virtualization technology can be used to create virtualized and integrated views of data in memory (rather than executing data movement and physically storing integrated views in a target data structure), and provides a layer of abstraction above the physical implementation of data.
  • 13.
    What Data Virtualizationis Not! • It is not ETL • If you want to replicate data from ‘A’ to ‘B’…use an ETL tool – it’s what they are designed for • It is not Data Visualization ( Note the ‘s’) • It complements visualization and reporting tools (e.g. Tableau) • It is not a database • Data Virtualization Platforms don’t store the data…it’s retrieved from the data sources on demand • It has many capabilities such as governance, metadata management, security, etc. • It will work with specialized tools in these areas • It’s great for service-based architectures • But be wary of event-driven architectures…use an ESB (or similar) for this 13
  • 14.
    − Gartner, Predicts2017: Data Distribution and Complexity Drive Information Infrastructure Modernization, Ted Friedman et al. By 2018, organizations with data virtualization capabilities will spend 40% less on building and managing data integration processes for connecting distributed data assets. 14
  • 15.
    Data Virtualization Platforms– Key Capabilities 15
  • 16.
    16 Five Essential Capabilitiesof Data Virtualization 4. Self-service data services 5. Centralized metadata, security & governance 1. Data abstraction 2. Zero replication, zero relocation 3. Real-time information
  • 17.
    17 1. Data abstraction Abstractsaccess to disparate data sources. Acts as a single virtual repository. Abstracts data complexities like location, format, protocols …hides data complexity for ease of data access by business Enterprise architects must revise their data architecture to meet the demand for fast data.” – Create a Road Map For A Real-time, Agile, Self- Service Data Platform, Forrester Research
  • 18.
    18 2. Zero replication,zero relocation …reduces development time and overall TCO The Denodo Platform enables us to build and deliver data services, to our internal and external consumers, within a day instead of the 1 – 2 weeks it would take with ETL.” – Manager, DrillingInfo Leaves the data at its source; extracts only what is needed, on demand. Diminishes the need for effort-intensive ETL processes. Eliminates unnecessary data redundancy.
  • 19.
    19 3. Real-time information Provisionsdata in real-time to consumers Creates real-time logical views of data across many data sources. Supports transformations and quality functions without the latency, redundancy, and rigidity of legacy approaches …enables timely decision-making Data virtualization integrates disparate data sources in real time or near-real time to meet demands for analytics and transactional data.” – Create a Road Map For A Real-time, Agile, Self-Service Data Platform, Forrester Research, Dec 16, 2015
  • 20.
    20 4. Self-service dataservices Facilitates access to all data, both internal and external Enables creation of universal semantic models reflecting business taxonomy Connects data silos to provide best available information to drive business decisions …enables information discovery and self-service Impressively quick turn around time to "unlock“ data from additional siloes and from legacy systems - Few vendors (if any) can compete with Denodo's support of the Restful /OData standard - both to provide data (northbound) and to access data from the sources (southbound).” – Business Analyst, Swiss Re
  • 21.
    21 5. Centralized metadata,security & governance Abstracts data source security models and enables single-point security and governance. Extends single-point control across cloud and on- premises architectures Provides multiple forms of metadata (technical, business, operational) to facilitate understanding of data. …simplifies data security, privacy, audit Our Denodo rollout was one of the easiest and most successful rollouts of critical enterprise software I have seen. It was successful in handling our initial, security, use case immediately, and has since shown a strong ability to cover additional use cases, in particular acting as a Data Abstraction Layer via it's web service functionality.” – Enterprise Architect, Asurion
  • 22.
    22 Denodo ‘Solution’ Categories CustomerCentricity / MDM ✓ Complete View of Customer Data Services ✓ Data as a Service ✓ Data Marketplace ✓ Data Services ✓ Application and Data Migration Cloud Solutions ✓ Cloud Modernization ✓ Cloud Analytics ✓ Hybrid Data Fabric Data Governance ✓ GRC ✓ GDPR ✓ Data Privacy / Masking BI and Analytics ✓ Self-Service Analytics ✓ Logical Data Warehouse ✓ Enterprise Data Fabric Big Data ✓ Logical Data Lake ✓ Data Warehouse Offloading ✓ IoT Analytics
  • 23.
    Product Demonstration Data Virtualization– An Introduction 23 Sales Engineer, Denodo Michael Dickson
  • 24.
    24 Demo Architecture What’s theimpact of a new marketing campaign for each country? ▪ Historical sales data offloaded to Hadoop cluster for cheaper storage ▪ Marketing campaigns managed in an external cloud app ▪ Country is part of the customer details table, stored in the DW Sources Combine, Transform & Integrate Consume Base View Source Abstraction join group by state join Sales Campaign Customer
  • 25.
  • 26.
    26 What is theoptimizer doing? SELECT c.state, AVG(s.amount) FROM customer c JOIN sales s ON c.id = s.customer_id GROUP BY c.state Sales Customer join group by Sales Customer join group by ID Group by state Sales Customer Create temp table join group by Temp_Customer Partial Aggregation PushdownNaïve Strategy Temporary Data Movement 300 M 2 M 2 M 2 M 2 M 50 SELECT c.id, amount FROM (SELECT s.customer_id, SUM(amount) amount FROM sales s GROUP BY s.customer_id) s_agg JOIN Customer c ON (c.id = s_agg.customer_id)
  • 27.
    27 Why is thisso important? SELECT c.name, AVG(s.amount) FROM customer c JOIN sales s ON c.id = s.customer_id GROUP BY c.state How Denodo works compared with other federation engines System Execution Time Data Transferred Optimization Technique Denodo 9 sec. 4 M Aggregation push-down Others 125 sec. 302 M None: full scan 300 M 2 M Sales Customer join group by 2 M 2 M Sales Customer join group by ID Group by state To maximize push down to the EDW the aggregation is split in 2 steps: • 1st by customerID • 2nd by state This significantly reduces network Traffic and processing In Denodo
  • 28.
    28 Massive Parallel Processing:Example 2M rows (sales by customer) Customer (2M rows) Sales (300 million rows) group by customer ID SELECT c.name, AVG(s.amount) FROM customer c JOIN sales s ON c.id = s.customer_id GROUP BY c.name join Group by name Similar to the previous query, but now aggregating by customer name. What changes? Partial Aggregation push down Maximizes source processing Reduces network traffic Swapping to Disk The aggregation by customer name produces a larger result set (2M) that exceeds the memory quota. Denodo will swap to disk to perform the intermediate calculation Serial Calculation Denodo will perform the calculation of the aggregation in serial, one row after another. With a larger volume, this now becomes the execution bottleneck Before MPP Integration
  • 29.
    29 join Group by ZIP join Groupby ZIP Massive Parallel Processing: Example 2M rows (sales by customer) Customer (2M rows) System Execution Time Optimization Techniques Others ~ 10 min Basic No MPP 43 sec Aggregation push-down With MPP 11 sec Aggregation push-down + MPP integration (Impala 8 nodes) Sales (300 million rows) join Group by ZIP 1. Partial Aggregation push down Maximizes source processing Reduces network traffic 3. On-demand data transfer For SQL-on-Hadoop systems, Denodo automatically generates and upload Parquet files 4. Integration with local and pre-cached data The engine detects when data Is cached or a is native table in the MPP 2. Integrated with Cost Based Optimizer Based on data volume estimation and the cost of these particular operations, the CBO can decide to move all or part Of the execution tree to the MPP 5. Fast parallel execution Support for Spark, Presto and Impala For fast analytical processing in inexpensive Hadoop-based solutions With MPP Integration group by customer ID
  • 30.
  • 31.
    Key Takeaways 31 FIRST Takeaway Data Virtualizationis a key technology when building a modern data architecture SECOND Takeaway It provides flexibility and agility and reduces the time to deliver data to the business by up to 10X THIRD Takeaway Data Virtualization hides the complexity of a constantly changing data infrastructure from the users FOURTH Takeaway In doing so, it allows you to introduce new technologies, formats, protocols, etc. without causing user disruption FIFTH Takeaway Beware! Not all Data Virtualization platforms are equal…compare them against the ‘5 criteria’
  • 32.
  • 33.
    Next steps Download DenodoExpress: www.denodoexpress.com Access Denodo Platform in the Cloud! 30 day FREE trial available! Denodo for Azure: www.denodo.com/TrialAzure/PackedLunch Denodo for AWS: www.denodo.com/TrialAWS/PackedLunch
  • 34.
    Next session From SinglePurpose to Multi Purpose Data Lakes - Broadening End Users Thursday, August 16, 2018 Paul Moxon VP Data Architectures & Chief Evangelist, Denodo
  • 35.
    Thank you! © CopyrightDenodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.