Architecture and Performance
Considerations in the Logical Data Lake
Dr. Alberto Pan, Chief Technical Officer
Architecture and Performance
Considerations in the Logical Data Lake
Dr. Alberto Pan, Chief Technical Officer
Agenda1. Data Lake Architecture
2.Data Virtualization in the Logical Data Lake
3.Performance: ‘Move Processing To the Data’
4.Performance: Choosing the Best Execution Plan
5.Example Scenario: The Numbers
Data Lake Architecture
5
Architecture of the Data Lake
Real-Time
Decision
Management
Alerts
Scorecards
Dashboards
Reporting
Data Discovery
Self-Service
Search
Predictive
Analytics
Statistical
Analytics (R)
Text Analytics
Data MiningData Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content
(Unstructured)
Big
Data
Enterprise
Applications
Traditional
Enterprise
Data
Cloud
Cloud
Applications
Metadata Management, Data Governance, Data Security
NoSQL
EDW
In-Memory
(SAP Hana, …)
Analytical
Appliances
Cloud DW
(Redshift,..)
ODS
Big Data
E
T
L
C
D
C
S
q
o
o
p
(Flume, Kafka, …)
Real-Time Data Access (On-Demand / Streaming)
Batch
YARN / Workload Management
HDFS
Hive
Spark
Drill
Impala
Storm HBase Solr
Hunk
DW Streams NoSQL SearchSQL
Hadoop
Tez
Map
Red.
6
How can I combine Data from Several Systems ensuring good
Performance ?
How can I abstract consuming applications from technology change
and requirements evolution ?
How can I enforce consistent Security and Governance Policies
across the Data Lake ?
Questions for the Logical Data Lake:
The Logical Data Lake Architecture
Integrated View of a Plurality of systems: Hadoop, EDW, Streaming, In-memory,...
DV in the Logical Data Lake
8
Architecture of the Data Lake
Real-Time
Decision
Management
Alerts
Scorecards
Dashboards
Reporting
Data Discovery
Self-Service
Search
Predictive
Analytics
Statistical
Analytics (R)
Text Analytics
Data MiningData Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content
(Unstructured)
Big
Data
Enterprise
Applications
Traditional
Enterprise
Data
Cloud
Cloud
Applications
Metadata Management, Data Governance, Data Security
NoSQL
EDW
In-Memory
(SAP Hana, …)
Analytical
Appliances
Cloud DW
(Redshift,..)
ODS
Big Data
E
T
L
C
D
C
S
q
o
o
p
(Flume, Kafka, …)
Real-Time Data Access (On-Demand / Streaming)
Batch
YARN / Workload Management
HDFS
Hive
Spark
Drill
Impala
Storm HBase Solr
Hunk
DW Streams NoSQL SearchSQL
Hadoop
Tez
Map
Red.
9
Architecture of the Logical Data Lake
Real-Time
Decision
Management
Alerts
Scorecards
Dashboards
Reporting
Data Discovery
Self-Service
Search
Predictive
Analytics
Statistical
Analytics (R)
Text Analytics
Data Mining
Data Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content
(Unstructured)
Big
Data
Enterprise
Applications
Traditional
Enterprise
Data
Cloud
Cloud
Applications
NoSQL
EDW
In-Memory
(SAP Hana, …)
Analytical
Appliances
Cloud DW
(Redshift,..)
ODS
Big Data
E
T
L
C
D
C
S
q
o
o
p
(Flume, Kafka, …)
Data Virtualization
Real-Time Data Access (On-Demand / Streaming)
Data Caching
DataServices
Data Search & Discovery
Governance
Security
Optimization
DataAbstraction
DataTransformation
DataFederation
Batch
YARN / Workload Management
HDFS
Hive
Spark
Drill
Impala
Storm HBase Solr
Hunk
DW Streams NoSQL SearchSQL
Hadoop
Tez
Map
Red.
10
What is Needed ?
Requirements for the Integration Component in the Logical Data Lake
Ability to answer ad-hoc queries combining data from several
systems
Performance comparable to physical approaches
Ability to expose different logical views over the same data
Single entry point to apply Security and Governance policies.
Comprehensive, granular security support
Denodo Data Virtualization is the only option verifying:
Performance: Move
Processing to the Data
12
Move Processing to the Data
Process the data where it resides
Process the data locally where
it resides
DV System combines partial
results
Minimizes network traffic
Leverages specialized data
sources
13
Move Processing to the Data: Example 1
Obtain Total Sales By Product (Naive Strategy)
Naive Strategy:
350M rows moved through the network
14
Move Processing to the Data: Example 1
Obtain Total Sales By Product (Move Processing to the Data)
Denodo Strategy:
30k rows moved through the network
15
Move Processing to the Data: Example 2
Maximum Sales Discount By Product in the last year: On-the-fly Data Movement
Move Products Data to a Temp table in the DW :
20K rows moved through the network + 10K
rows inserted in the DW
Execute full query on the DW:
10k rows through the network
16
Move Processing to the Data: Example 2
Maximum Sales Discount By Product in the last year: Partial aggregation Pushdown
Products DB:
10K rows through the network
Data Warehouse:
#rows through the network = 10K * average
#sale_prices_per_product
Performance: Choosing the
Best Execution Plan
18
How to Choose the Best Execution Plan?
Cost-Based Optimization in Data Virtualization
Data statistics to estimate size of intermediate result sets
Data Source Indexes (and other physical structures)
Execution Model of data sources: e.g. Parallel Databases VS
Hadoop clusters VS Relational Databases
Features of data sources (e.g. number of processing cores in
parallel database or Hadoop Cluster)
Data Transfer rate
Must take into account:
Example Scenario: The
Numbers
20
Example Scenario: The Numbers
Best Performance Even When Processing Billions of Rows
Performance Comparison of
Physical vs Logical
Scenario
Big Data volumes
TPC-DS benchmark
Sales
(Netezza)
Customers
(Oracle) Items
(SQLServer)
290M
2M 400K
21
Example Scenario: The Numbers
Physical vs Logical DW Performance
Query Description Rows Returned
AVG Time Physical (all
data in Netezza)
AVG Time Logical
Optimization
Technique
(automatically
chosen by Denodo
6.0)
Total sales by customer 1,99 M 20975 ms 21457 ms
Full group by
pushdown
Total sales by customer and year
between 2000 and 2004 5,51 M 52313 ms 59060 ms
Full group by
pushdown
Total sales by item brand 31,35 K 4697 ms 5330 ms
Partial group by
pushdown
Total sales by item where sale
price less than current list price 17,05 K 3509 ms 5229 ms
On the fly data
movement
Thanks!
www.denodo.com info@denodo.com
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical,
including photocopying and microfilm, without prior the written authorization from Denodo Technologies.
Find more details at: datavirtualization.blog
http://www.datavirtualizationblog.com/myths-in-data-
virtualization-performance/

Big Data: Architecture and Performance Considerations in Logical Data Lakes

  • 1.
    Architecture and Performance Considerationsin the Logical Data Lake Dr. Alberto Pan, Chief Technical Officer
  • 2.
    Architecture and Performance Considerationsin the Logical Data Lake Dr. Alberto Pan, Chief Technical Officer
  • 3.
    Agenda1. Data LakeArchitecture 2.Data Virtualization in the Logical Data Lake 3.Performance: ‘Move Processing To the Data’ 4.Performance: Choosing the Best Execution Plan 5.Example Scenario: The Numbers
  • 4.
  • 5.
    5 Architecture of theData Lake Real-Time Decision Management Alerts Scorecards Dashboards Reporting Data Discovery Self-Service Search Predictive Analytics Statistical Analytics (R) Text Analytics Data MiningData Warehouse Sensor Data Machine Data (Logs) Social Data Clickstream Data Internet Data Image and Video Enterprise Content (Unstructured) Big Data Enterprise Applications Traditional Enterprise Data Cloud Cloud Applications Metadata Management, Data Governance, Data Security NoSQL EDW In-Memory (SAP Hana, …) Analytical Appliances Cloud DW (Redshift,..) ODS Big Data E T L C D C S q o o p (Flume, Kafka, …) Real-Time Data Access (On-Demand / Streaming) Batch YARN / Workload Management HDFS Hive Spark Drill Impala Storm HBase Solr Hunk DW Streams NoSQL SearchSQL Hadoop Tez Map Red.
  • 6.
    6 How can Icombine Data from Several Systems ensuring good Performance ? How can I abstract consuming applications from technology change and requirements evolution ? How can I enforce consistent Security and Governance Policies across the Data Lake ? Questions for the Logical Data Lake: The Logical Data Lake Architecture Integrated View of a Plurality of systems: Hadoop, EDW, Streaming, In-memory,...
  • 7.
    DV in theLogical Data Lake
  • 8.
    8 Architecture of theData Lake Real-Time Decision Management Alerts Scorecards Dashboards Reporting Data Discovery Self-Service Search Predictive Analytics Statistical Analytics (R) Text Analytics Data MiningData Warehouse Sensor Data Machine Data (Logs) Social Data Clickstream Data Internet Data Image and Video Enterprise Content (Unstructured) Big Data Enterprise Applications Traditional Enterprise Data Cloud Cloud Applications Metadata Management, Data Governance, Data Security NoSQL EDW In-Memory (SAP Hana, …) Analytical Appliances Cloud DW (Redshift,..) ODS Big Data E T L C D C S q o o p (Flume, Kafka, …) Real-Time Data Access (On-Demand / Streaming) Batch YARN / Workload Management HDFS Hive Spark Drill Impala Storm HBase Solr Hunk DW Streams NoSQL SearchSQL Hadoop Tez Map Red.
  • 9.
    9 Architecture of theLogical Data Lake Real-Time Decision Management Alerts Scorecards Dashboards Reporting Data Discovery Self-Service Search Predictive Analytics Statistical Analytics (R) Text Analytics Data Mining Data Warehouse Sensor Data Machine Data (Logs) Social Data Clickstream Data Internet Data Image and Video Enterprise Content (Unstructured) Big Data Enterprise Applications Traditional Enterprise Data Cloud Cloud Applications NoSQL EDW In-Memory (SAP Hana, …) Analytical Appliances Cloud DW (Redshift,..) ODS Big Data E T L C D C S q o o p (Flume, Kafka, …) Data Virtualization Real-Time Data Access (On-Demand / Streaming) Data Caching DataServices Data Search & Discovery Governance Security Optimization DataAbstraction DataTransformation DataFederation Batch YARN / Workload Management HDFS Hive Spark Drill Impala Storm HBase Solr Hunk DW Streams NoSQL SearchSQL Hadoop Tez Map Red.
  • 10.
    10 What is Needed? Requirements for the Integration Component in the Logical Data Lake Ability to answer ad-hoc queries combining data from several systems Performance comparable to physical approaches Ability to expose different logical views over the same data Single entry point to apply Security and Governance policies. Comprehensive, granular security support Denodo Data Virtualization is the only option verifying:
  • 11.
  • 12.
    12 Move Processing tothe Data Process the data where it resides Process the data locally where it resides DV System combines partial results Minimizes network traffic Leverages specialized data sources
  • 13.
    13 Move Processing tothe Data: Example 1 Obtain Total Sales By Product (Naive Strategy) Naive Strategy: 350M rows moved through the network
  • 14.
    14 Move Processing tothe Data: Example 1 Obtain Total Sales By Product (Move Processing to the Data) Denodo Strategy: 30k rows moved through the network
  • 15.
    15 Move Processing tothe Data: Example 2 Maximum Sales Discount By Product in the last year: On-the-fly Data Movement Move Products Data to a Temp table in the DW : 20K rows moved through the network + 10K rows inserted in the DW Execute full query on the DW: 10k rows through the network
  • 16.
    16 Move Processing tothe Data: Example 2 Maximum Sales Discount By Product in the last year: Partial aggregation Pushdown Products DB: 10K rows through the network Data Warehouse: #rows through the network = 10K * average #sale_prices_per_product
  • 17.
  • 18.
    18 How to Choosethe Best Execution Plan? Cost-Based Optimization in Data Virtualization Data statistics to estimate size of intermediate result sets Data Source Indexes (and other physical structures) Execution Model of data sources: e.g. Parallel Databases VS Hadoop clusters VS Relational Databases Features of data sources (e.g. number of processing cores in parallel database or Hadoop Cluster) Data Transfer rate Must take into account:
  • 19.
  • 20.
    20 Example Scenario: TheNumbers Best Performance Even When Processing Billions of Rows Performance Comparison of Physical vs Logical Scenario Big Data volumes TPC-DS benchmark Sales (Netezza) Customers (Oracle) Items (SQLServer) 290M 2M 400K
  • 21.
    21 Example Scenario: TheNumbers Physical vs Logical DW Performance Query Description Rows Returned AVG Time Physical (all data in Netezza) AVG Time Logical Optimization Technique (automatically chosen by Denodo 6.0) Total sales by customer 1,99 M 20975 ms 21457 ms Full group by pushdown Total sales by customer and year between 2000 and 2004 5,51 M 52313 ms 59060 ms Full group by pushdown Total sales by item brand 31,35 K 4697 ms 5330 ms Partial group by pushdown Total sales by item where sale price less than current list price 17,05 K 3509 ms 5229 ms On the fly data movement
  • 22.
    Thanks! www.denodo.com info@denodo.com © CopyrightDenodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies. Find more details at: datavirtualization.blog http://www.datavirtualizationblog.com/myths-in-data- virtualization-performance/