Watch full webinar here: https://bit.ly/3JBpwGm
Data lakes and data warehouses offer organizations a centralized data delivery platform. From the recent Building the Unified Data Warehouse and Data Lake report by leading industry analysts TDWI, we discovered 64% of organizations stated the objective for a unified Data Warehouse and Data Lakes is to get more business value and that 84% of organizations polled felt that a unified approach to Data Warehouses and Data Lakes was either extremely or moderately important.
In the recent report Logical Data Fabric to the Rescue Integrating Data Warehouses, Data Lakes, and Data Hubs by Rick van der Lans, we also discovered the importance of “time to insight and speed”.
During this webinar, we will discuss how a logical data fabric not only helps organizations have a holistic view of their data across multiple data lakes, data warehouses, and data sources but how it improves time to value.
Catch this on-demand session & learn:
- How a Logical Data Fabric is the right approach to assist organizations to unify their data.
- The advanced features of a Logical Data Fabric that assist with optimizing your queries irrespective of data source, whether the data is in a data lake, data warehouse, or other sources.
- How a Logical Data Fabric with Data Virtualization enhances your legacy data integration landscape to simplify data access and encourage self-service.
3. Agenda
DENODO LUNCH AND LEARN ASEAN
1. What is a Data Lake?
2. Why Do They Exist ?
3. Some of the Challenges of Data Lakes
4. The Benefits of a Logical Approach to Data Lakes
5. Customer Case Study
6. Demo
7. Conclusion
8. Q&A
9. Next Steps
5. 5
DENODO LUNCH AND LEARN ASEAN
Etymology of “Data Lake”
Pentaho’s CTO James Dixon is credited with coining the
term "data lake". He described it in his blog in 2010:
"If you think of a data mart as a store of bottled water – cleansed
and packaged and structured for easy consumption – the data
lake is a large body of water in a more natural state. The contents
of the data lake stream in from a source to fill the lake, and
various users of the lake can come to examine, dive in, or take
samples."
https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
6. 6
DENODO LUNCH AND LEARN ASEAN
Data lakes were born to efficiently address
the challenge of cost reduction.
Data lakes allow for cheap, efficient
storage of very large amounts of data.
Cloud implementation simplified the
complexity of managing a large data lake.
7. 7
The Data Lake – Architecture I
Distributed File System
Cheap storage for large data volumes
• Support for multiple file formats (Parquet, CSV,
JSON, etc)
• Examples:
• On-prem: HDFS
• Cloud native: AWS S3, Azure ADLS, Google GCS
8. 8
The Data Lake – Architecture II
Distributed File System
Execution Engine
Massively parallel & scalable
execution engine
• Cheaper execution than traditional EDW
architectures
• Decoupled from storage
• Doesn’t require specialized HW
• Examples:
• SQL-on-Hadoop engines: Spark, Hive, Impala,
Drill, Dremio, Presto, etc.
• Cloud native: AWS Redshift, Snowflake, AWS
Athena, Delta Lake, GCP BigQuery
9. 9
The Data Lake – Architecture III
Adoption of new transformation
techniques
• Data ingested is normally raw and unusable by end
users
• Data is transformed and moved to different “zones”
with different levels of curation
• End users only access the refined zone
• Use of ELT as a cheaper transformation technique
than ETL
• Use of the engine and storage of the lake for data
transformation instead of external ETL flows
• Removes the need for additional staging HW
Raw zone Trusted zone Refined Zone
Distributed File System
Execution Engine
10. 10
Data Lake Example – AWS
§ Data ingested using AWS Glue (or other ETL tools)
§ Raw data stored in S3 object store
§ Maintain fidelity and structure of data
§ Metadata extracted/enriched using Glue Data
Catalog
§ Business rules/DQ rules applied to S3 data as
copied to Trusted Zone data stores
§ Trusted Zone contains more than one data store –
select best data store for data and data processing
§ Refined Zone contains data for consumer – curated
data sets (data marts?)
§ Refined Zone data stores differ – Redshift, Athena,
Snowflake, …
TRUSTED ZONE
RAW ZONE
S3 for raw data
INGESTION
Data Sources
Internal
&
External
AWS Glue
Consumers
Data Portals
BI – Visualization
Analytic
Workbench
Mobile Apps
Etc.
REFINED ZONE
11. 11
Hadoop-Based Data Lakes – A Data Scientist’s Playground
§ The early data scientists saw Hadoop as their
personal supercomputer.
§ Hadoop-based Data Lakes helped democratize
access to state-of-the-art supercomputing with off-
the-shelf HW (and later cloud)
§ The industry push for BI made Hadoop–based
solutions the standard to bring modern analytics to
any corporation.
Hadoop-based Data Lakes became
“data science silos”
12. 12
DENODO LUNCH AND LEARN ASEAN
Can data lakes also address
the other data management
challenges?
Can they provide fast
decision making with proper
governance and security?
13. 13
Changing the Data Lake Goals
“The popular view is that a
data lake will be the one
destination for all the data
in their enterprise and the
optimal platform for all
their analytics.”
Nick Heudecker, Gartner
14. 14
DENODO LUNCH AND LEARN ASEAN
Rick Van der Lans, R20 Consultancy
Multi-purpose data lakes are data delivery environments
developed to support a broad range of users, from traditional
self-service BI users (e.g. finance, marketing, human resource,
transport) to sophisticated data scientists.
Multi-purpose data lakes allow a broader and deeper use of the
data lake investment without minimizing the potential value for
data science and without making it an inflexible environment.
15. 15
DENODO LUNCH AND LEARN ASEAN
The Data Lake as the Repository of All Data
Is that realistic? And even, if possible, it comes with multiple trade-offs:
COST
GOVERNANCE
• Huge up-front investment
Creating ingestion pipelines for all company datasets into the lake is costly.
• Large recurrent maintenance costs
Those pipelines need to be constantly modified as data structures change in the sources
Efficient use of the data lake to accelerate insights comes at the cost of price, time-to-market and governance
• Risk of inconsistencies
Data needs to be frequently synchronized to avoid stale datasets
• Loss of capabilities
Data lake capabilities may differ from those of original sources, e.g. quick access by ID in
operational RDBMS
16. 16
DENODO LUNCH AND LEARN ASEAN
Restricting the use of the data lake to a specific use case (eg: Data Science)
Purpose-specific Data Lakes
TTM
SECURITY
An environment with multiple purpose-specific systems slows down TTM and jeopardizes security and governance
• Higher Complexity
End Users need to find where data is and how to use it
• Risk of Inconsistencies
Data may be in multiple places, in different formats and calculated at different times
• Loss of Security
Frustrations increase the use of Shadow IT, “personal” extracts, uncontrolled data prep
flows, etc.
17. 17
Data Lakes in the ‘Pit of Despair’
Data Lakes are 2-5 years
from Plateau of Productivity
and are deep in the Trough
of Disillusionment
Gartner – Hype Cycle Data Management July 2021
19. 19
Gartner – Logical Data Warehouse
“Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018
DATA VIRTUALIZATION
20. 20
…Data lakes lack semantic consistency and governed
metadata. Meeting the needs of wider audiences require
curated repositories with governance, semantic
consistency and access controls.”
21. 21
DENODO LUNCH AND LEARN ASEAN
How can a logical data
fabric approach help?
22. 22
DENODO LUNCH AND LEARN ASEAN
Faster Time-to-Market for Data Projects
Why?
• The Data Virtualization Platform allows you to connect directly to all kinds of data sources (EDW, application
databases, SaaS applications, etc.)
• Thus not all data needs to be replicated to the data lake for consumers to access it from a single (virtual)
repository.
• In some cases, it makes sense to replicate in the lake, for others it doesn’t. Data Virtualization opens that door
Capabilities
• Data can be accessed immediately, easily improving TTM and ROI of the lake
• If data is not useful, time was not lost preparing pipelines and copying data
• Can ingest and synchronize data into the lake efficiently when needed
• Denodo can load and update data into the data lake natively, using Parquet, and parallel
loads
• Execution is pushed down to original sources, taking advantage of their capabilities
• Especially significant in the case of EDW with strong processing capabilities
TTM
COST
23. 23
DENODO LUNCH AND LEARN ASEAN
Easier Self-Service through a Single Data Delivery Layer
Why?
• From an end user perspective, access to all data is done through a single layer, regardless of data formats and its
actual physical location.
• A single delivery layer also allows you to enforce security and governance policies
• The virtual layer becomes the “delivery zone” of the data lake, offering modeling and caching capabilities,
documentation and output in multiple formats
Capabilities
• Built-in rich modeling capabilities to tailor data models to end users
• Integrated catalog, search and documentation capabilities
• Access via SQL, REST, OData and GraphQL with no additional coding
• Advanced security controls, SSO, workload management, monitoring, etc.
GOVERNANCE
24. 24
DENODO LUNCH AND LEARN ASEAN
Accelerates Query Execution
Why?
Controlling data delivery separately from storage allows a virtual layer to accelerate query execution,
providing faster response than the sources alone.
Capabilities
• Aggregate-aware capabilities to accelerate execution of analytical queries
• Flexible caching options to materialize frequently used data:
• Full datasets
• Partial results
• Hybrid (cached content + updates from source in real time)
• Powerful optimization capabilities for multi-source federated queries PERFORMANCE
25. 25
Denodo’s Logical Data Lake
ETL
Data Warehouse
Kafka
Physical Data
Lake
Logical Data Lake
Files
ETL
Data Warehouse
Kafka
Physical Data Lake
Files
IT Storage and Processing
BI & Reporting
Mobile
Applications
Predictive Analytics
AI/ML
Real time dashboards
Consuming Tools
Query
Engine
Business
Delivery
Source
Abstraction
Business Catalog
Security and Governance
Delivery Zone
27. Problem Solution Results
Case Study
27
DENODO LUNCH AND LEARN ASEAN
Leading Construction Manufacturer Improves
Service Delivery and Revenue
§ Telemetry (IoT) data from sensors
embedded in the equipment is stored in
Hadoop to perform predictive analytics
§ Denodo integrates analytics data with
parts, maintenance, and dealer
information stored in traditional systems
§ It then feeds the predictive maintenance
information to a customer dashboard
In business for over 90 years and is the world’s leading manufacturer of construction
and mining equipment, diesel and natural gas engines, industrial gas turbines and
diesel-electric locomotive.
§ Phased rollout systematically improved
asset performance and proactive
maintenance
§ Increased revenue from sale of services
and parts
§ Reduced warranty costs of parts failure
§ Future – optimize pricing for services
and parts among global service providers
27
§ Competitive pressure from low-cost
Chinese manufacturers
§ Needed a proactive approach to
customer service to differentiate
§ Sought to improve equipment and
services delivery through predictive
maintenance
31. DENODO LUNCH AND LEARN ASEAN
Key Takeaways
1. In most cases, not all the data is going to be in the
data lake
2. Large data lake projects are complex environments
that will benefit from a virtual ‘consumption’ layer
3. Data virtualization provides a governance and
management infrastructure required for successful
data lake implementation
4. Data Virtualization is more than just a data access or
services layer, it is a key component for a Data Lake
34. 35
Get Started Today
Try Denodo for a Test Drive with a 30-day
free trial in the cloud marketplaces
CHOICE
Under your cloud account
SUPPORT
Community forum AND remote sales
engineer
OPPORTUNITY
30 minutes free consultation with
Denodo Cloud specialist
denodo.link/drive22
35. 36
DENODO LUNCH AND LEARN ASEAN
Logical Data Fabric to the Rescue:
Integrating Data Warehouses,
Data Lakes, and Data Hubs
ACCESS YOUR REPORT
denodo.link/LDF21
36. 37
DENODO LUNCH AND LEARN ASEAN
Data Democratization with
a Logical Data Fabric
REGISTER NOW
denodo.link/FD22
APAC | April 27 | 9:30 am SGT