Virtualisation de données
Adonis Harrouk
Keyrus: Data Engineer Big Data & Cloud
Vincent Fages-Gouyou,
Denodo: Director of Product Management EMEA
8 October 2020
Enjeux, Usages & Bénéfices
Keyrus Introduction
Expertise & Partenaires
3
Keyrus
International player in digital/data technologies and performance
management consultancy
MANAGEMENT & TRANSFORMATION
Developing agility and accelerating
The use of digital
INSIGHT INTO VALUE
▪ Helping enterprises in data, digital and
management since 1996
▪ Innovation & research is at the heart of
the enterprise
▪ Worldwide presence in 20 countries on 4
continents
DATA INTELLIGENCE
Master and valorize the data to bring
Information and enhance overall
performance
DIGITAL EXPERIENCE
Develop the digital experience in an
overgrowing digital society
4
Accessing data from various sources
Structured dataData Sources Unstructured data
Ingestion
Data Access Data Insight
ETL/Batch/Streaming…
Data Warehouse Data Lake
Data Visualization
Reporting
Data Science
?
Denodo Data Virtualization
6
Denodo
The Leader in Data Virtualization
DENODO OFFICES, CUSTOMERS, PARTNERS
Palo Alto, CA.
Global presence throughout North America,
EMEA, APAC, and Latin America.
LEADERSHIP
▪ Longest continuous focus on data
virtualization – since 1999
▪ Leader in 2018 Forrester Wave – Big
Data Fabric
▪ Winner of numerous awards
CUSTOMERS
~850 customers, including many F500 and
G2000 companies across every major industry
have gained significant business agility and ROI.
FINANCIALS
Backed by $4B+ private equity firm (HGGC)..
50+% annual growth; Profitable.
Rising complexity of Data?
8
Rising complexity of data
Rising complexity of data
▪ Eclectic mix of old and new data; every structure imaginable
▪ Generated and integrated, from batch to real time
▪ Traditional data from enterprise apps, web, third-parties
▪ New sources of data from machines, social media, IoT
Rising complexity of data management solutions
▪ Mix of home grown, vendor built, open source
▪ Multi-platform architectures; distributed and heterogeneous; on
premises or cloud; from relational to Hadoop
▪ Hybrid and diverse in the extreme.
9
Ready Access to Critical Information to Support Business Processes
The Business Need
MarketingSales ExecutiveSupport
Customers
Invoices Products
Service
Usage
Access to complete information: business
entities and pre-integrated views
Access to related information: discovery
and self service
Access in real-time from different apps and
devices
10
Manually access different systems
Not productive – slows down
response times
IT responds with point-to-point data
integration and replication
Data Is Siloed Across Disparate Systems
The Challenge
MarketingSales ExecutiveSupport
Database
Apps
Warehouse Cloud
Big Data
Documents AppsNo SQL
11
Analytics Value Escalator
Traditional BI
Advanced Analytics
12
The Analytics Chasm
13
Analytics Needs Data
Input data for a data science project may come in a variety of systems
and formats:
• Files (CSV, logs, Parquet)
• Relational databases (EDW, operational systems)
• NoSQL systems (key-value pairs, document stores, time series, etc.)
• SaaS APIs (Salesforce, Marketo, ServiceNow, Facebook, Twitter, etc.)
Data are all over the places (on premises, in the Cloud, SaaS, IoT, etc.)
In addition, the Big Data community has also embraced data science as
one of their pillars. For example Spark and SparkML, and architectural
patterns like the Data Lake
What is Data Virtualization?
15
Gartner – The Evolution of Analytical Environments
This is a Second Major Cycle of Analytical Consolidation
Operational Application
Operational Application
Operational Application
IoT Data
Other NewData
Operational
Application
Operational
Application
Cube
Operational
Application
Cube
? Operational Application
Operational Application
Operational Application
IoT Data
Other NewData
1980s
Pre EDW
1990s
EDW
2010s2000s
Post EDW
Time
LDW
Operational
Application
Operational
Application
Operational
Application
Data
Warehouse
Data
Warehouse
Data
Lake
?
LDW
Data Warehouse
Data Lake
Marts
ODS
Staging/Ingest
Unified analysis
› Consolidated data
› "Collect the data"
› Single server, multiple nodes
› More analysis than any
one server can provide
©2018 Gartner, Inc.
Unified analysis
› Logically consolidated view of all data
› "Connect and collect"
› Multiple servers, of multiple nodes
› More analysis than any one system can provide
ID: 342254
Fragmented/
nonexistent analysis
› Multiple sources
› Multiple structured sources
Fragmented analysis
› "Collect the data" (Into
› different repositories)
› New data types,
› processing, requirements
› Uncoordinated views
Adopt the Logical Data Warehouse Architecture to Meet Your
Modern Analytical Needs”. Henry Cook, Gartner April 2018
16
Gartner – Logical Data Warehouse
“Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018
DATA VIRTUALIZATION
17
VizualisationML / AIData ScienceData Quality
Agile Data Hub Architecture
Data Sources
Data Warehouse
noSQL
RDBMS
18
VizualisationML / AIData ScienceData Quality
Agile Data Hub Architecture
Governance, Metadata Management, Data Mart
Security
Data Access
Data Virtualization Data Services
Data Sources
Data Warehouse
noSQL
RDBMS
VizualisationML / AIData ScienceData Quality
Governance, Metadata Management, Data Mart
Security
Data Access
Data Virtualization Data Services
Data Sources
noSQL
RDBMS
19
Agile Data Hub Architecture
Consumers
Data Warehouse
Data Sources
Federation
Transformation
Abstraction
Data Service Dynamic Query
Optimization
Cost Based
Optimizer
Query
Rewriting
Caching MPP
Security &
Governance
Lifecycle
Management
Data Catalog
Discover
Collaborate
Query
Categorize
Denodo Performance
21
Denodo’s Massive Parallel Processing
3M rows
(sales by customer)
Current Sales
290 M rows
group by
customer ID1. Partial Aggregation
push down
Maximizes source processing
Dramatically Reduces network
traffic
4. Integration with local
and pre-cached data
The engine detects when data
Is cached or comes from a
local table already in the MPP
2. Integrated with Cost Based Optimizer
Based on data volume estimation and
the cost of these particular operations,
the CBO can decide to move all or part
Of the execution tree to the MPP
5. Fast parallel execution
Support for Spark, Presto and Impala
For fast analytical processing in
inexpensive Hadoop-based solutions
Hist. Sales
300 M rows
Customer
3 M rows
join
group by name
SELECT
c.c_first_name, c.c_last_name,
SUM(ss.ss_quantity),
AVG(ss.ss_sales_price)
FROM
(SELECT * FROM current_store_sales UNION ALL
SELECT * FROM historic_store_sales) ss
JOIN sqls_customer c
ON ss.ss_customer_sk = c.c_customer_sk
GROUP BY c.c_first_name, c.c_last_name
3. On-demand data transfer
Denodo automatically generates
and upload Parquet files
In parallel
§ Customer: 3 million
§ Current sales: 290 million
§ Historic sales: 300 million
22
Denodo’s Massive Parallel Processing
System Execution Time #Rows through network Optimization Techniques
Other federation systems ~ 10 min 593M Simple federation
Hadoop/MPP systems ~ 4 min 293M MPP Only
Denodo (No MPP) 43 sec 6M Aggregation push-down
Denodo (With MPP) 11 sec 6M Aggregation push-down + MPP integration (8 nodes)
SELECT
c.c_first_name, c.c_last_name,
SUM(ss.ss_quantity),
AVG(ss.ss_sales_price)
FROM
(SELECT * FROM current_store_sales UNION
ALL
SELECT * FROM historic_store_sales) ss
JOIN sqls_customer c
ON ss.ss_customer_sk = c.c_customer_sk
GROUP BY c.c_first_name, c.c_last_name
Comparing execution times of the same queries with Denodo and other federation systems.
Smaller is better
0 100 200 300 400 500 600 700
Denodo (With MPP)
Denodo (NoMPP)
Hadoop/MPP systems
Federation systems
Execution Time (seconds)
23
Smart Caching for Analytics
Denodo 8 will enable the persistence of summaries to accelerate the
execution of analytical queries
§ Common joins, aggregations and filters can be precomputed (in the cache or in a
data source) and used as starting points to accelerate queries
§ Key for LDW self-service initiatives where user-driven exploration is a must
Similar to the concept of aggregation awareness using by reporting
tools (BO, MSTR) and OLAP engines
§ Integrated with Denodo’s engine query rewriting rules and CBO to provide
features not available in any other vendor for LDW scenarios
§ Denodo can provide this acceleration technology for all data sources and all
consumers
Denodo Platform
25
Denodo Developer
Administration &
Operations
Business User
& BI Analyst
Data Scientist
Application-to-Application
One Platform - Multiple Personas
26
8.0 Technical Architecture
DATA CATALOG
Discover - Explore - Document
{ API ACCESS }
RESTful / OData
GraphQL / GeoJSON
SQL
CONSUMERS
DATA VIRTUALIZATION
CONNECTIVITY
LOGICALDATAFABRICSOURCES
Traditional
DB & DW
150+
data
adapters
Cloud
Stores
Hadoop
& NoSQL
OLAP Files Apps Streaming SaaS
Query
Optimization
SecurityAI/ML Governance
Semantic
Layer
Real Time
Acceleration
Caching
DATA OPS
Deployment
Cloud PaaS
Containers/K8
On-Prem
Monitoring
Scheduling
Version Control
DEVELOPMENT
MODELING
DELIVERY
27
“Denodo provides its customers with the
necessary product capabilities for
automating the data fabric design with
its core platform components – a unified
semantic catalog, a dynamic query
optimization engine and runtime
metadata-based ML algorithms. Its data
fabric design relies on data virtualization
to provide integrated data quickly to
business users to effect faster outcomes.”
2020 Gartner Magic Quadrant for Data Integration Tools
Denodo is Named a Leader
External Use
28
Summary
1. Faster & more accurate decision making
§ Self-service with proper guardrails
§ Data models and catalog are two of the same
2. IT cost reduction
▪ Decouple IT from business, giving them freedom to
choose the right technology for the right problem
3. Regulations, enterprise-wide governance &
data security
▪ Controlled access all data assets in secure,
business friendly format
▪ Full audit trails
The combination of a business delivery layer with an abstraction layer in a single
platform can efficiently address those three business challenges:
Keyrus Conclusion
Data Virtualization is a game changer
30
Data Virtualization use cases
From Data Storage & Management, to Data Consumers, going through Data Governance & Security
Decision
(Real time)
Single View
(Customer 360)
Agile BI
(Self-service)
Data Science
(ML & AI)
APPS
(Mobile & web)
Mergers &
Acquisitions
Data
Marketplace
Compliances
(IFRS17, GRC)
Data
Security
APIfication
(& SQLification)
Unified Data
Layer
Agility
& Simplicity
Real-time
Delivery
Data
Abstraction
Zero
Replication
Data
Governance
Sophisticated
Optimizations
Logical Data
Warehouse/Lake
Big Data
Fabric
Hybrid
Data Fabric
Data
Integration
Data
Migration
Refactoring &
Replatforming
Data Consumption
Data Storage & Management
Data Governance, Manipulation & Access
Sales
HR
Executive
Marketing Apps/API
Data Science
AI/ML
Thanks!
www.denodo.com info@denodo.com
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm,
without prior the written authorization from Denodo Technologies.

Virtualisation de données : Enjeux, Usages & Bénéfices

  • 1.
    Virtualisation de données AdonisHarrouk Keyrus: Data Engineer Big Data & Cloud Vincent Fages-Gouyou, Denodo: Director of Product Management EMEA 8 October 2020 Enjeux, Usages & Bénéfices
  • 2.
  • 3.
    3 Keyrus International player indigital/data technologies and performance management consultancy MANAGEMENT & TRANSFORMATION Developing agility and accelerating The use of digital INSIGHT INTO VALUE ▪ Helping enterprises in data, digital and management since 1996 ▪ Innovation & research is at the heart of the enterprise ▪ Worldwide presence in 20 countries on 4 continents DATA INTELLIGENCE Master and valorize the data to bring Information and enhance overall performance DIGITAL EXPERIENCE Develop the digital experience in an overgrowing digital society
  • 4.
    4 Accessing data fromvarious sources Structured dataData Sources Unstructured data Ingestion Data Access Data Insight ETL/Batch/Streaming… Data Warehouse Data Lake Data Visualization Reporting Data Science ?
  • 5.
  • 6.
    6 Denodo The Leader inData Virtualization DENODO OFFICES, CUSTOMERS, PARTNERS Palo Alto, CA. Global presence throughout North America, EMEA, APAC, and Latin America. LEADERSHIP ▪ Longest continuous focus on data virtualization – since 1999 ▪ Leader in 2018 Forrester Wave – Big Data Fabric ▪ Winner of numerous awards CUSTOMERS ~850 customers, including many F500 and G2000 companies across every major industry have gained significant business agility and ROI. FINANCIALS Backed by $4B+ private equity firm (HGGC).. 50+% annual growth; Profitable.
  • 7.
  • 8.
    8 Rising complexity ofdata Rising complexity of data ▪ Eclectic mix of old and new data; every structure imaginable ▪ Generated and integrated, from batch to real time ▪ Traditional data from enterprise apps, web, third-parties ▪ New sources of data from machines, social media, IoT Rising complexity of data management solutions ▪ Mix of home grown, vendor built, open source ▪ Multi-platform architectures; distributed and heterogeneous; on premises or cloud; from relational to Hadoop ▪ Hybrid and diverse in the extreme.
  • 9.
    9 Ready Access toCritical Information to Support Business Processes The Business Need MarketingSales ExecutiveSupport Customers Invoices Products Service Usage Access to complete information: business entities and pre-integrated views Access to related information: discovery and self service Access in real-time from different apps and devices
  • 10.
    10 Manually access differentsystems Not productive – slows down response times IT responds with point-to-point data integration and replication Data Is Siloed Across Disparate Systems The Challenge MarketingSales ExecutiveSupport Database Apps Warehouse Cloud Big Data Documents AppsNo SQL
  • 11.
  • 12.
  • 13.
    13 Analytics Needs Data Inputdata for a data science project may come in a variety of systems and formats: • Files (CSV, logs, Parquet) • Relational databases (EDW, operational systems) • NoSQL systems (key-value pairs, document stores, time series, etc.) • SaaS APIs (Salesforce, Marketo, ServiceNow, Facebook, Twitter, etc.) Data are all over the places (on premises, in the Cloud, SaaS, IoT, etc.) In addition, the Big Data community has also embraced data science as one of their pillars. For example Spark and SparkML, and architectural patterns like the Data Lake
  • 14.
    What is DataVirtualization?
  • 15.
    15 Gartner – TheEvolution of Analytical Environments This is a Second Major Cycle of Analytical Consolidation Operational Application Operational Application Operational Application IoT Data Other NewData Operational Application Operational Application Cube Operational Application Cube ? Operational Application Operational Application Operational Application IoT Data Other NewData 1980s Pre EDW 1990s EDW 2010s2000s Post EDW Time LDW Operational Application Operational Application Operational Application Data Warehouse Data Warehouse Data Lake ? LDW Data Warehouse Data Lake Marts ODS Staging/Ingest Unified analysis › Consolidated data › "Collect the data" › Single server, multiple nodes › More analysis than any one server can provide ©2018 Gartner, Inc. Unified analysis › Logically consolidated view of all data › "Connect and collect" › Multiple servers, of multiple nodes › More analysis than any one system can provide ID: 342254 Fragmented/ nonexistent analysis › Multiple sources › Multiple structured sources Fragmented analysis › "Collect the data" (Into › different repositories) › New data types, › processing, requirements › Uncoordinated views Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018
  • 16.
    16 Gartner – LogicalData Warehouse “Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018 DATA VIRTUALIZATION
  • 17.
    17 VizualisationML / AIDataScienceData Quality Agile Data Hub Architecture Data Sources Data Warehouse noSQL RDBMS
  • 18.
    18 VizualisationML / AIDataScienceData Quality Agile Data Hub Architecture Governance, Metadata Management, Data Mart Security Data Access Data Virtualization Data Services Data Sources Data Warehouse noSQL RDBMS
  • 19.
    VizualisationML / AIDataScienceData Quality Governance, Metadata Management, Data Mart Security Data Access Data Virtualization Data Services Data Sources noSQL RDBMS 19 Agile Data Hub Architecture Consumers Data Warehouse Data Sources Federation Transformation Abstraction Data Service Dynamic Query Optimization Cost Based Optimizer Query Rewriting Caching MPP Security & Governance Lifecycle Management Data Catalog Discover Collaborate Query Categorize
  • 20.
  • 21.
    21 Denodo’s Massive ParallelProcessing 3M rows (sales by customer) Current Sales 290 M rows group by customer ID1. Partial Aggregation push down Maximizes source processing Dramatically Reduces network traffic 4. Integration with local and pre-cached data The engine detects when data Is cached or comes from a local table already in the MPP 2. Integrated with Cost Based Optimizer Based on data volume estimation and the cost of these particular operations, the CBO can decide to move all or part Of the execution tree to the MPP 5. Fast parallel execution Support for Spark, Presto and Impala For fast analytical processing in inexpensive Hadoop-based solutions Hist. Sales 300 M rows Customer 3 M rows join group by name SELECT c.c_first_name, c.c_last_name, SUM(ss.ss_quantity), AVG(ss.ss_sales_price) FROM (SELECT * FROM current_store_sales UNION ALL SELECT * FROM historic_store_sales) ss JOIN sqls_customer c ON ss.ss_customer_sk = c.c_customer_sk GROUP BY c.c_first_name, c.c_last_name 3. On-demand data transfer Denodo automatically generates and upload Parquet files In parallel § Customer: 3 million § Current sales: 290 million § Historic sales: 300 million
  • 22.
    22 Denodo’s Massive ParallelProcessing System Execution Time #Rows through network Optimization Techniques Other federation systems ~ 10 min 593M Simple federation Hadoop/MPP systems ~ 4 min 293M MPP Only Denodo (No MPP) 43 sec 6M Aggregation push-down Denodo (With MPP) 11 sec 6M Aggregation push-down + MPP integration (8 nodes) SELECT c.c_first_name, c.c_last_name, SUM(ss.ss_quantity), AVG(ss.ss_sales_price) FROM (SELECT * FROM current_store_sales UNION ALL SELECT * FROM historic_store_sales) ss JOIN sqls_customer c ON ss.ss_customer_sk = c.c_customer_sk GROUP BY c.c_first_name, c.c_last_name Comparing execution times of the same queries with Denodo and other federation systems. Smaller is better 0 100 200 300 400 500 600 700 Denodo (With MPP) Denodo (NoMPP) Hadoop/MPP systems Federation systems Execution Time (seconds)
  • 23.
    23 Smart Caching forAnalytics Denodo 8 will enable the persistence of summaries to accelerate the execution of analytical queries § Common joins, aggregations and filters can be precomputed (in the cache or in a data source) and used as starting points to accelerate queries § Key for LDW self-service initiatives where user-driven exploration is a must Similar to the concept of aggregation awareness using by reporting tools (BO, MSTR) and OLAP engines § Integrated with Denodo’s engine query rewriting rules and CBO to provide features not available in any other vendor for LDW scenarios § Denodo can provide this acceleration technology for all data sources and all consumers
  • 24.
  • 25.
    25 Denodo Developer Administration & Operations BusinessUser & BI Analyst Data Scientist Application-to-Application One Platform - Multiple Personas
  • 26.
    26 8.0 Technical Architecture DATACATALOG Discover - Explore - Document { API ACCESS } RESTful / OData GraphQL / GeoJSON SQL CONSUMERS DATA VIRTUALIZATION CONNECTIVITY LOGICALDATAFABRICSOURCES Traditional DB & DW 150+ data adapters Cloud Stores Hadoop & NoSQL OLAP Files Apps Streaming SaaS Query Optimization SecurityAI/ML Governance Semantic Layer Real Time Acceleration Caching DATA OPS Deployment Cloud PaaS Containers/K8 On-Prem Monitoring Scheduling Version Control DEVELOPMENT MODELING DELIVERY
  • 27.
    27 “Denodo provides itscustomers with the necessary product capabilities for automating the data fabric design with its core platform components – a unified semantic catalog, a dynamic query optimization engine and runtime metadata-based ML algorithms. Its data fabric design relies on data virtualization to provide integrated data quickly to business users to effect faster outcomes.” 2020 Gartner Magic Quadrant for Data Integration Tools Denodo is Named a Leader External Use
  • 28.
    28 Summary 1. Faster &more accurate decision making § Self-service with proper guardrails § Data models and catalog are two of the same 2. IT cost reduction ▪ Decouple IT from business, giving them freedom to choose the right technology for the right problem 3. Regulations, enterprise-wide governance & data security ▪ Controlled access all data assets in secure, business friendly format ▪ Full audit trails The combination of a business delivery layer with an abstraction layer in a single platform can efficiently address those three business challenges:
  • 29.
  • 30.
    30 Data Virtualization usecases From Data Storage & Management, to Data Consumers, going through Data Governance & Security Decision (Real time) Single View (Customer 360) Agile BI (Self-service) Data Science (ML & AI) APPS (Mobile & web) Mergers & Acquisitions Data Marketplace Compliances (IFRS17, GRC) Data Security APIfication (& SQLification) Unified Data Layer Agility & Simplicity Real-time Delivery Data Abstraction Zero Replication Data Governance Sophisticated Optimizations Logical Data Warehouse/Lake Big Data Fabric Hybrid Data Fabric Data Integration Data Migration Refactoring & Replatforming Data Consumption Data Storage & Management Data Governance, Manipulation & Access Sales HR Executive Marketing Apps/API Data Science AI/ML
  • 31.
    Thanks! www.denodo.com info@denodo.com © CopyrightDenodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.