Data lakehouse fallacies

Copyright 2021 Hired Brains Research and Neil Raden
Data Lake/Lakehouse/Cloud Data
Warehouse: Which is Real?
By Neil Raden, Founder Hired Brains Research
Data Lake:
From Hadoop (and later, other cloud storage options), which was indifferent to the size
and type of files that could be processed, as opposed to the rigid and not nearly as
scalable nature of relational data warehouses, hatched the idea of the single place for
everything – the data lake. In truth, it was a concept hatched by the Hadoop
distributors to sell more licenses. Though it did simplify searching for and locating files,
it provided no analytical processing tools at all. The logic of moving a JSON file from
Paris, France, to a Paris, Texas cloud location adds no value except for some economies
of scale in storage and processing.
The data lake collects raw data, thousands, perhaps millions of files. This is
posited as a benefit. But is it really? At a certain level, raw data is an
oxymoron. We can't triangulate data to see if it's consistent with other
instances of the same phenomenon or event. "Raw data" typically implies it is
to be used for a particular purpose, and it is the beginning point for drawing
inferences and drawing conclusions. The context of data — why, how, and
when it was recorded, and what method it was collected and then transformed
is essential. Context-free data simply does not exist. The perfect objectivity we
assign to "raw data" is a myth. That's why in data warehousing, we attempted
to integrate and rationalize things.
Industry analyst Andrew Brust, in "Big on Data," quotes George Fraser, CEO of
Fivetran:
"I think 2021 will reveal the need for data lakes in the modern data stack is
shrinking...there are no longer new technical reasons for adopting data lakes because
data warehouses that separate compute from storage have emerged." If that's not
categorical enough for you, Fraser sums things up thus: "In the world of the modern
data stack, data lakes are not the optimal solution. They are becoming legacy
technology."
For organizations that lack cloud-native data warehouses that separate compute from
storage or even lack a cloud strategy, that is something of an oversimplification. The
calculation of costs of hybrid-cloud, multi-cloud, separation of storage from
compute...border on alchemy. And even a good approximation is only as good as when

you make it because things change so quickly. There is one secret, though, that you
will do worse without a model no matter what approach you take.
Another thing to consider is that "organization" is often an oxymoron. While there may
be a single "strategy" for data architecture in most organizations, the result of
acquisitions, legacies, geography, and just the usual punctuated progress, there may be
a collection of them, distributed physically and architecturally. The best advice is:
Pay more attention to what your data means than where you put it.
To patch some of the data lake idea's manifest deficiencies, cloud providers have
regularly added processing capabilities that mimic early data warehousing features –
comically calling it the "Data Lakehouse" (or the Databricks variant, the Delta Lake)
Data Lakehouse:
According to Databricks, "A data lakehouse is a new, open data management paradigm
that combines the capabilities of data lakes and data warehouses, enabling BI and ML
on all data. ... Merging them into a single system means that data teams can move
faster as they can use data without accessing multiple systems." This statement is more
aspirational than fact. Data warehouses represent forty years of continuous (though not
always smooth) progress and provide all of the services that are needed, such as:
• AI-driven query optimizer
• Complex query formation
• Massively parallel operation based on the model, not just sharding
• Workload Management
• Load Balancing
• Scaling to thousands of simultaneous queries
• Full ANSI SQL and beyond
• In-database Advanced Analytics and support for ML
• Ability to handle native data types such as spatial and time-series
The fact is that some data warehouse platforms do perform all of these functions and
more and are very central to the operations of businesses.
In the early seventies, the world was beset with an energy crisis. Some executives in
Detroit decided that the US needed small cars, with which they had little experience,
but they came up with a platform anyway. But Americans loved their pickup trucks,
which accounted for a substantial share of the automaker's revenue, Ford and Chevy
especially. When you have a terrible solution, the worst thing you can do is pile on
more terrible decisions - the 1973 Ford Courier mini pickup truck, one of the worst,
poorly designed, ill-conceived vehicles in history.

If you can query a JSON file in the Data Lakehouse with SQL transparently, you have
accomplished something. But not enough. What troubles me the most is that the data
lakehouse's excuse is that it's a data lake with some analytical capabilities. What I
haven't heard are understandability and usability. Those capabilities are mostly
inherited from the expanding capabilities of cloud services themselves.
Cloud Data Warehouse:
Cloud data warehouses and there are principally three: AWS Redshift, Snowflake, and
Google BigQuery. Many other relational data warehouse technologies have acceptable
cloud versions, but the cloud-natives claim the high ground for now. At a certain
maturity, they provide all of the functions listed above, rather than being bolt-on
capabilities to generic cloud features. However, it does get a little blurry because the
CDW's provide more than a traditional data warehouse. One, for example, proves a
public data exchange market. I've noticed the word "warehouse" starting to disappear
from their content.
Would you rather have a cloud-native data warehouse that can handle the most
challenging data warehouse tasks but can also provide most of the functionality of a
data lake (or, to put it another way, to eliminate the need for a data lake), or would
you prefer a data lake with partial data warehouse capabilities slapped on?
To sum up:
1. The concept of a data lake is flawed. In an age of multi-cloud and hybrid- cloud
distributed data, not to mention sprawling sensor farms of IoT, there is no
advantage to pulling it all together. AI-driven knowledge graphs are a far better
alternative to locating and tagging data where it is.
2. If you dismiss the data lake, you must of necessity dismiss the lake house
3. Pay more attention to what your data means than where you put it
A data lake looks to me to be static "dumb" data neatly arranged. A data lakehouse, if
you must use that term, is fundamentally different from a data warehouse. It is a
comprehensive set of capabilities that provides a graph-based linked and
contextualized information fabric (semantic metadata and linked datasets) where NLP
(Natural Language Processing), Sentiment Analysis, Rules Engines, Connectors,

Canonical Models for common domains. Add to that cognitive tools that can be
plugged in to turn "dumb" data into information assets with speed, agility, reuse, and
value. I haven't seen one yet.
Neil Raden founded Hired Brains Research in 1985 to provide thought leadership, context, and advisory
consulting and implementation services in Data Architecture, AI, Analytics/Data Science, and organizational
change for analytics for clients worldwide across many industries. Neil is a recognized authority on AI Ethics,
the co-author of the first book on Decision Man agent, "Smart (Enough) Systems," and the foundational
report for the Society of Actuaries, "Ethical Use of Artificial Intelligence for Actuaries." He welcomes your
comments at nraden@hiredbrains.com

Data lakehouse fallacies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data lakehouse fallacies

Similar to Data lakehouse fallacies (20)

More from Neil Raden

More from Neil Raden (13)

Recently uploaded

Recently uploaded (20)

Data lakehouse fallacies