Watch full webinar here: https://bit.ly/3KMRTEV
With the appearance of cloud object storage services like AWS S3 or Azure ADLS, the data lake has seen an upturn in usage as some of the challenges of the original idea were addressed. However, companies across the globe still find it challenging to adopt data lakes into the corporate data ecosystem. While almost infinite in storage, data retrieval from these sources and integration of the data with the corporate ecosystem is still an arduous task for data engineers. This leads to data lakes becoming either a silo or a secondary form of storage instead of feeding business processes and creating value.
Join us in this session with Antonio Tortosa, a Technical Consultant at Denodo, who will discuss how the integration power of a logical data fabric working in tandem with the processing power of an MPP engine allows businesses to unravel their existing data lakes and seamlessly adopt them into the corporate data ecosystem. By the end of the session, you will have an understanding of how the Denodo Platform can help connect, explore, and integrate data lakes into a logical data fabric.
Topics covered in this session:
- Why Massive Parallel Processing.
- How Denodo leverages the computational power of MPP engines.
- How Denodo and Presto can be used to process and integrate disparate data, including that of a data lake.
Unraveling the Data Lake: MPP integration within a Logical Data Fabric
1. Unraveling the Data Lake.
MPP integration within a
Logical Data Fabric
Antonio Tortosa
Technical Consultant | Denodo
2. AGENDA
1. The challenge of cloud object storage
2. Incorporating Massive Parallel Processing engines into a logical data
fabric
3. Denodo Platform and Presto
4. 4
The simplified version of object storage
The challenge of cloud object storage
Source: Amazon S3. How it works
■ Cheap storage for backup, old or rarely used data
■ Ingest 3rd party data
■ Move non-critical workloads to cheaper systems
■ Data science playground
5. 5
The reality of enterprise data strategy
The challenge of cloud object storage
Data Lake / Object Storage
Enterprise Data
Warehouse
Business Intelligence
Reporting
Data Discovery
Other Apps
On-prem
data
CDC
ETL
6. 6
The missing pieces
The challenge of cloud object storage
Processing - An engine capable of effectively processing the data stored
However, an MPP engine alone is not enough, as seen by the failures of previous
incarnations of Data Lake projects
Integration - A logical model serving a common canonical view of the data ecosystem
Data in the object storage is just a portion of the data in the organization. All data
should be managed with consistency, regardless of location
Data Management - Fine grained Security & Data Governance
Ease of data discovery. Documentation, classification and search capabilities.
Fine-grained security and access control
8. 8
Parallel Processing of object storage data
Incorporating MPP into a logical data fabric
Logical Layer MPP Coordinator
MPP Worker
MPP Worker
MPP Worker
MPP Worker
Object
Storage
Data query
Data flow
Other calls
9. 9
Integration of object storage with the data ecosystem
Incorporating MPP into a logical data fabric
Logical Layer MPP Coordinator
Other Sources
MPP Worker
MPP Worker
MPP Worker
MPP Worker
Object
Storage
Data query
Data flow
Other calls
11. 11
Execution
At execution time Denodo sends the
query to Presto, now having objects
storage files natively mapped.
In addition, if other Denodo data
sources have to be used Presto uses
its Denodo connector to pull that
data into the worker nodes memory
in real-time.
Introspection
Denodo can connect to the object
storage, for example S3 buckets, and
graphically browse the folders and
files.
Parquet files, folders with content,
and partitions are automatically
detected and the developers choose
the ones that will become Denodo
base views.
Mapping
Denodo connects to Presto and
creates the necessary structures to
map the object storage files in the
target schema. Denodo automatically
detects field data types and
partitions.
Denodo then creates base views
from these tables.
The process at a glance
Denodo Platform and Presto
12. 12
Introspection of Object Storage
Denodo Platform and Presto
MPP Worker
MPP Worker
MPP Worker
Object
Storage
■ The MPP Workers need to have the object storage files mapped
internally as tables.
○ This is typically done manually by data engineers and need different
tools to navigate the object storage and to create the tables in the
MPP engine.
■ Denodo simplifies this process by providing a unified point of view.
○ The same tool that allows introspection into the object storage
manages the mapping of the files to tables.
15. 15
Execution - Presto with other sources
Denodo Platform and Presto
Logical Layer MPP Coordinator
Other Sources
MPP Worker
MPP Worker
MPP Worker
MPP Worker
Object
Storage
SQL
query
Data flow
Other
calls
16. 16
Execution - Presto with other sources
Denodo Platform and Presto
90,859 rows
2,880,404 rows
17. 17
Execution - Presto with other sources
Denodo Platform and Presto
Fully delegated query
to Presto
■ customer_crm is brought into memory
in real time to the worker nodes so its
locally referenced as
tmp_231_885_0_2549
■ store_sales was previously mapped
by Denodo Platform and is referenced in
this query as
vdp_table_167509671357
18. 18
Enterprise Data Architecture
Final Solution
Data Lake / Object Storage
Enterprise Data
Warehouse
Business Intelligence
Reporting
Other Apps
On-prem
data
CDC
ETL
Denodo
Virtual DataPort
Denodo
Web Services
Denodo
MPP
Denodo
Data Catalog
19. CLOSING
REMARKS
▪ There is a renewed interest in Data Lakes thanks to cloud object storage solving some of the
original drawbacks
▪ However, they are not a single solution to a realistic enterprise data strategy. In addition to
this we must consider as well
○ Cost-effective processing
○ Integration with other sources within the enterprise data ecosystem
○ Data governance and data security
▪ Denodo Platform has always provided these. Yet, in its 2023Q1 update, the state-of-the-art
integration with Presto provides a even better solution to the integration of Data Lakes into a
logical data fabric.
▪ Now, Denodo customers can, from a unified access layer,
○ Introspect object storage files
○ Integrate object storage data with other corporate sources to increase data adoption
▪ With this, Denodo Platform will seamlessly colocate the data in the Presto MPP cluster to
accelerate query execution.