Watch full webinar here: https://bit.ly/3hgOSwm
Data Lake technologies have been in constant evolution in recent years, with each iteration primising to fix what previous ones failed to accomplish. Several data lake engines are hitting the market with better ingestion, governance, and acceleration capabilities that aim to create the ultimate data repository. But isn't that the promise of a logical architecture with data virtualization too? So, what’s the difference between the two technologies? Are they friends or foes? This session will explore the details.
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
1.
2. Data Lake Acceleration vs. Data
Virtualization
What’s the difference?
Pablo Alvarez-Yanez
Global Director of Product Management, Denodo
3. 3
Data and Analytics
DATA
QUESTIONS
Known Unknown
Known
Unknown
Innovation
and Exploration
Expanding
Understanding
and Investigating
Establishing
Value
Foundational
Core
Data
Warehouse
Data lake
Data Science
Operational
Analytics
• Data analysis comes in different sizes and flavors
• Transactional and analytical systems have
traditionally provide data management for OLAP
and OLTP use cases
• During the last decade, a myriad of new systems
have evolved to cover additional scenarios:
noSQL, graphs, indexers, schema-on-read
RDNBMS, etc.
• Additionally, new architectural paradigms, like
data lakes, have flourished
• New techniques, like AI and ML, have also
extended the methods available for data analysis
4. 4
EDW are expanding
• Data warehouse vendors are
adding capabilities to expand
their range:
• Support for additional formats like
JSON and Parquet files to work as
data lake engines
• Embedded ML algorithms to
enable data science capabilities
DATA
QUESTIONS
Known Unknown
Known
Unknown
Innovation
and Exploration
Expanding
Understanding
and Investigating
Establishing
Value
Foundational
Core
Data
Warehouse
Data Warehouse
5. 5
Data Lakes are expanding
• Data lake vendors are also
expanding
• Have added traditional EDW
capabilities like ACID compliance
• Many Also embed ML capabilities
DATA
QUESTIONS
Known Unknown
Known
Unknown
Innovation
and Exploration
Expanding
Understanding
and Investigating
Establishing
Value
Foundational
Core
Data lake
Data lake
6. 6
What’s happening
DATA
QUESTIONS
Known Unknown
Known
Unknown
Innovation
and Exploration
Expanding
Understanding
and Investigating
Establishing
Value
Foundational
Core
• Some vendors argue that the
separation of processing and
storage has finally made possible
to bring all data into a single
repository
• Some data lake and EDW
vendors are luring customers
with the promise of cheap
storage and fast execution for all
your data
Data
Warehouse
Data lake
Data Science
Operational
Analytics
7. 7
Is that true?
• Can a data lake turned EDW (or vice versa)
with data science capabilities be your
ultimate data repository?
• Let’s analyze this in detail
1. Is that technically feasible?
2. Is it realistic to operate such a system?
3. Is it cost efficient?
8. 8
Is it technically feasible?
• Currently, there is not a single system that is
best-of-breed for all kinds of data processing
• Schema-on-write EDW vendors are better for analytical
queries, but lack flexibility for less structured data
• Schema-on-read data lake vendors lag in performance
compared with their EDW counterparts
• Operational RDBMs remain the best option for
transactional data management
• Specialized engines for time-series, graphs, indexers,
etc. provide additional flavors to data processing that
are best-fit for certain scenarios
9. 9
Is it realistic to operate such a system?
• For a single repository to work, it needs to contain
every piece of data in the organization
• This includes all internal and external systems of all kinds
• Ingestion pipelines need to be created, operated, and
maintained for each of the sources
• Some data may need near-real-time replication, requiring
CDC-based replication that is more complex to operate and
maintain
• Raw data within the system will need to be curated
and transformed to adapt to the needs of the
consumer, creating more pipelines
10. 10
Is it cost efficient?
• Is replicating and curating all data cost-effective?
• What do you do with data seldomly used?
• It’s kind of a catch-22 situation:
• If you don’t ingest it, you may be limiting the quality of
your insights.
• If you replicate it, the ROI of the data lake will be
questionable
• Why did you spend so much time and money to ingest
these data nobody has ever used?
• Cost for “expanded” workloads can skyrocket
• E.g. data science with a pricing model designed for OLAP
12. 12
The challenges of a distributed architecture
DATA
QUESTIONS
Known Unknown
Known
Unknown
Innovation
and Exploration
Expanding
Understanding
and Investigating
Establishing
Value
Foundational
Core
• Extending a single physical system to cover
all the requirements of a modern data
strategy doesn’t seem feasible
• Need for collaboration instead of competition
• However, using multiple different systems
creates a complexity that end users won’t
be able to navigate, or will slow them
down
• Lose agility and quality of insights
• In addition, securing and governing a
distributed landscape poses another
challenge
• Can a logical and distributed architecture
be the solution?
Data
Warehouse
Data lake
Data Science
Operational
Analytics
13. 13
Modern Architectures: LDW and Data Fabric
▪ Distributed: Data resides at multiple systems / locations
▪ Data today is too big and distributed
▪ Modern analytics needs are too diverse: one size never fits all
▪ Hybrid and multicloud
▪ Logical: Consumers access data through semantic models,
decoupled from data location and physical schemas
▪ No need to deal with specific languages, formats and protocols
for each source
▪ Semantic models are business friendly and enforce common
policies
▪ Allows for technology evolution and infrastructure changes
▪ Cloud transition
▪ Flexibility to change integration methods (e.g. real time to persisted)
Source: The Practical Logical Data Warehouse
Gartner, Dec 2020
Source: Demystifying the Data Fabric
Gartner, September 2020
14. 14
Denodo: bridge the gap between infrastructure and business
1. Single Access Point to all Data
at any location
2. Data is exposed using the
formats and conventions most
appropriate for each
consumer through a semantic
layer
3. Access any data using any
technology and style: SQL,
APIs, Streaming,…
4. Trusted Data: enforce
consistent semantics, data
quality, governance and
security at Data Delivery
5. Discoverability: Active Data
Catalog builds a Data
Marketplace for the business
6. Active metadata generates
intelligence for the whole DM
infrastructure
15. 1. EDWs and Data Lakes are key pieces of any modern
data strategy
2. However, extending those systems to centralize all
analytical activities is problematic: loss of capabilities,
harder to operate/govern and skyrocketing costs
3. A logical approach that consolidates all systems to act
as one is a more realistic and cost effective approach
4. Denodo has a proven track record in LDW and Data
Fabric enterprise projects
16. Thank you for attending!
On-demand sessions and slides will be available soon