This presentation explains in detail what a Data Lake Architecture looks like, how data virtualization fits into the Logical Data Lake, and goes over some performance tips. Also it includes an example demonstrating this model's performance.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/9Jwfu6.
3. Agenda1. Data Lake Architecture
2.Data Virtualization in the Logical Data Lake
3.Performance: ‘Move Processing To the Data’
4.Performance: Choosing the Best Execution Plan
5.Example Scenario: The Numbers
5. 5
Architecture of the Data Lake
Real-Time
Decision
Management
Alerts
Scorecards
Dashboards
Reporting
Data Discovery
Self-Service
Search
Predictive
Analytics
Statistical
Analytics (R)
Text Analytics
Data MiningData Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content
(Unstructured)
Big
Data
Enterprise
Applications
Traditional
Enterprise
Data
Cloud
Cloud
Applications
Metadata Management, Data Governance, Data Security
NoSQL
EDW
In-Memory
(SAP Hana, …)
Analytical
Appliances
Cloud DW
(Redshift,..)
ODS
Big Data
E
T
L
C
D
C
S
q
o
o
p
(Flume, Kafka, …)
Real-Time Data Access (On-Demand / Streaming)
Batch
YARN / Workload Management
HDFS
Hive
Spark
Drill
Impala
Storm HBase Solr
Hunk
DW Streams NoSQL SearchSQL
Hadoop
Tez
Map
Red.
6. 6
How can I combine Data from Several Systems ensuring good
Performance ?
How can I abstract consuming applications from technology change
and requirements evolution ?
How can I enforce consistent Security and Governance Policies
across the Data Lake ?
Questions for the Logical Data Lake:
The Logical Data Lake Architecture
Integrated View of a Plurality of systems: Hadoop, EDW, Streaming, In-memory,...
8. 8
Architecture of the Data Lake
Real-Time
Decision
Management
Alerts
Scorecards
Dashboards
Reporting
Data Discovery
Self-Service
Search
Predictive
Analytics
Statistical
Analytics (R)
Text Analytics
Data MiningData Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content
(Unstructured)
Big
Data
Enterprise
Applications
Traditional
Enterprise
Data
Cloud
Cloud
Applications
Metadata Management, Data Governance, Data Security
NoSQL
EDW
In-Memory
(SAP Hana, …)
Analytical
Appliances
Cloud DW
(Redshift,..)
ODS
Big Data
E
T
L
C
D
C
S
q
o
o
p
(Flume, Kafka, …)
Real-Time Data Access (On-Demand / Streaming)
Batch
YARN / Workload Management
HDFS
Hive
Spark
Drill
Impala
Storm HBase Solr
Hunk
DW Streams NoSQL SearchSQL
Hadoop
Tez
Map
Red.
9. 9
Architecture of the Logical Data Lake
Real-Time
Decision
Management
Alerts
Scorecards
Dashboards
Reporting
Data Discovery
Self-Service
Search
Predictive
Analytics
Statistical
Analytics (R)
Text Analytics
Data Mining
Data Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content
(Unstructured)
Big
Data
Enterprise
Applications
Traditional
Enterprise
Data
Cloud
Cloud
Applications
NoSQL
EDW
In-Memory
(SAP Hana, …)
Analytical
Appliances
Cloud DW
(Redshift,..)
ODS
Big Data
E
T
L
C
D
C
S
q
o
o
p
(Flume, Kafka, …)
Data Virtualization
Real-Time Data Access (On-Demand / Streaming)
Data Caching
DataServices
Data Search & Discovery
Governance
Security
Optimization
DataAbstraction
DataTransformation
DataFederation
Batch
YARN / Workload Management
HDFS
Hive
Spark
Drill
Impala
Storm HBase Solr
Hunk
DW Streams NoSQL SearchSQL
Hadoop
Tez
Map
Red.
10. 10
What is Needed ?
Requirements for the Integration Component in the Logical Data Lake
Ability to answer ad-hoc queries combining data from several
systems
Performance comparable to physical approaches
Ability to expose different logical views over the same data
Single entry point to apply Security and Governance policies.
Comprehensive, granular security support
Denodo Data Virtualization is the only option verifying:
12. 12
Move Processing to the Data
Process the data where it resides
Process the data locally where
it resides
DV System combines partial
results
Minimizes network traffic
Leverages specialized data
sources
13. 13
Move Processing to the Data: Example 1
Obtain Total Sales By Product (Naive Strategy)
Naive Strategy:
350M rows moved through the network
14. 14
Move Processing to the Data: Example 1
Obtain Total Sales By Product (Move Processing to the Data)
Denodo Strategy:
30k rows moved through the network
15. 15
Move Processing to the Data: Example 2
Maximum Sales Discount By Product in the last year: On-the-fly Data Movement
Move Products Data to a Temp table in the DW :
20K rows moved through the network + 10K
rows inserted in the DW
Execute full query on the DW:
10k rows through the network
16. 16
Move Processing to the Data: Example 2
Maximum Sales Discount By Product in the last year: Partial aggregation Pushdown
Products DB:
10K rows through the network
Data Warehouse:
#rows through the network = 10K * average
#sale_prices_per_product
18. 18
How to Choose the Best Execution Plan?
Cost-Based Optimization in Data Virtualization
Data statistics to estimate size of intermediate result sets
Data Source Indexes (and other physical structures)
Execution Model of data sources: e.g. Parallel Databases VS
Hadoop clusters VS Relational Databases
Features of data sources (e.g. number of processing cores in
parallel database or Hadoop Cluster)
Data Transfer rate
Must take into account:
20. 20
Example Scenario: The Numbers
Best Performance Even When Processing Billions of Rows
Performance Comparison of
Physical vs Logical
Scenario
Big Data volumes
TPC-DS benchmark
Sales
(Netezza)
Customers
(Oracle) Items
(SQLServer)
290M
2M 400K
21. 21
Example Scenario: The Numbers
Physical vs Logical DW Performance
Query Description Rows Returned
AVG Time Physical (all
data in Netezza)
AVG Time Logical
Optimization
Technique
(automatically
chosen by Denodo
6.0)
Total sales by customer 1,99 M 20975 ms 21457 ms
Full group by
pushdown
Total sales by customer and year
between 2000 and 2004 5,51 M 52313 ms 59060 ms
Full group by
pushdown
Total sales by item brand 31,35 K 4697 ms 5330 ms
Partial group by
pushdown
Total sales by item where sale
price less than current list price 17,05 K 3509 ms 5229 ms
On the fly data
movement