Data Mesh: What is it, for Who, for who definitely not?
What are it's foundational principles and how could we take some of them to our current Data Analytical Architectures?
3. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
3
No, it won’t ‘save us’….
There is NO quick fix to become ‘Data Driven’
or ‘information supported’
or whatever you want to call what we are doing here…
But there are Data Mesh aspects that make sense!
…if you ignore most of the (tech) vendor washing…
As in…
4. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
4
About me...
Rogier Werschkull
21 years ‘in field’ of Data, Data warehousing and
Business Intelligence
Data architecture advise, data modeling, data
engineering, data-analytics product owner
Blogger, trainer, conference speaker
Contact details:
www.linkedin.com/in/rogierwerschkull/
rogier@rogerdata.nl
@rwerschkull
8. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
8
“Every single company I've worked at and talked to has the
same problem without a single exception so far:
poor data quality...
Either there's incomplete data, missing data,
duplicative data.”
Ruslan Belkin, former VP of Engineering @ Twitter and Salesforce
9. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
9
Data mesh tries (again?) to combat the ‘Analytics Misery’
we tend to create…
HOW?
By decentralizing most ‘data warehousing concerns’ to individual
business domains. As in:
1. In either the operational source system
2. Or in a decentralized DWH team that sits ‘closer’ to the source systems
By ‘calling out’ the required organizational / cultural change to
accomplish this…
What is data Mesh-1?
10. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
10
Data mesh is described in the official Data Mesh Book by
Zhamak Dehghani (thoughtworks) as follows:
‘Data mesh is a decentralized sociotechnical approach to share,
access, and manage analytical data in complex and large-scale
environments—within or across organizations’
What is data Mesh-2?
11. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
11
Only for large organizations, as in:
Having a lot of source systems
A lot of employees
Where there are clear / separated business domains
Only for large organizations that…
are not afraid to experiment
• That can live with the current absence of viable implementation patterns
have a mature (centralized?) data / analytics department
preferably can influence the design / development of the operational
applications they use
For WHO and WHEN?
12. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
12
Data mesh is foremost about
people and processes
About chaining the data-analytics
‘culture’
About the process of
collaborating on…
creating decentralized, valuable
data products within a ‘business
domain’
and sharing data between
these domains
NOT about technology!
People
Process
Data
Technology
14. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
14
It’s not that I don’t agree with the core data mesh principles
It’s the way they are explained: ‘quite academic’
Implementation guidelines are still missing
• Which Zhamak also clearly mentions, as in that it is still an emerging concept!
But then still, some vital context is truly missing
A mayor issue I have with the book (that we need to counter)
I really have doubts on the amount of data integration experience the
contributors have, based on that
• it states that folks building DWH’s are still striving for the ‘single version of the truth’
(if you lived 10 years ago yes…)
• there is no mention at all of modern ELM-based data modeling patterns
(that are there to help data integration)
Quite often the book’s content does
not help in this aspect…
18. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
18
It’s about doing
this
analytical
work with
data
somehow
somewhere
Structure information so it can
be consumed easily. Shaped
for a diverse type of users, use
cases and tools
Reliable, durable
integration / unification
of data
Register the history and
history of changes
to data
Store data you receive
once, protected from
ungoverned deletion
DWH
19. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
19
Common DWH patterns overview
Data -
analytics
Architectur
e
What is it? Architecture
solution
features
Most common
development
style
Data warehousing ‘concern’
Subject Oriented Integrated Time-Variant Non-Volatile
Data lake A repository of raw
data of any type for
analytics purposes
Decentralized
tech.
On premise or
cloud
Decentralized
‘puddles of lake’
No No Yes, in general
implemented as a
file store
Yes, in general
implemented as a file
store
DWH 3.0 /
‘Lakehouse
’
The re-merger of
data lake and
‘classical’ DWH
concerns, also
known as the
‘Modern DWH’
Centralized
tech. Cloud
based
Centralized or
decentralized
depending on
business
complexity
Yes, via database
transformation
rules
Yes, via database
integration rules
Yes, via a database
Historical Staging
Area
Yes, via a database
Historical Staging
Area
Data
Mesh
Distributed data
architecture that
pushes down
‘DWH concerns’
the source /
‘business domain’
Highly
decentralized
tech.
On premise
or cloud
Highly
decentralized
by definition.
Focus on ‘data
as a product’
Yes, but
mainly pushed
to the ‘business
domains’
Yes, local
withing the
business
domain and
centralized via
a ‘knowledge
graph’ like
‘mapping’
Yes, but pushed
to the ‘business
domains’
Yes, but pushed
to the ‘business
domains’
Data Fabric Distributed data
architecture where
‘time variant / non
volatile’ concerns are
pushed down to the
source systems
Centralized
tech
On premise or
cloud.
Sources
decentralized
Centralized or
decentralized
depending on
business
complexity
Yes, via
centralised
virtual
transformation
rules
Yes, via
centralised
virtual integration
logic
What the
operational
system provides
or by creating a
Historical Staging
area in an analytical
What the
operational system
provides or by
creating a Historical
Staging area in an
analytical database
21. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
21
1. Principle of domain ownership
1. Analytical data should be owned by either the source system or its main
consumers
2.Data as a product
1.Build data artifacts with a true product
(management) mindset
3. Self Service data Platform
Use (shared) Infrastructure as a platform (in the cloud?) to build this
4. Federated Computational Governance
Data governance operating model based on a federated decision-making and
accountability
Based on these foundational principles…
24. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
24
Feasible
Can the product be made in an acceptable time & for acceptable costs?
Valuable
What are the desires of my customers?
What is my market?
• How to do marketing?
What is the USP?
What ‘price is justified?’
Are my customers happy?
Usable
Is the product being used?
Is the product easy (enough) to use?
Are my customers happy?
Some examples of work you’ll need to do
here!
25. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
25
To see ‘data / information as a product’*it practically needs to be:
Discoverable
An easy, google-like way to find data sets
Addressable
The product needs to have a permanent unique identifier that stays stable over time
Understandable
The product needs to accompanied with metadata that describes WHAT something is
Trustworthy and ‘truthful’
The product needs to have a lot a data quality metrics and lineage metadata attached
Natively accessible
Accessible via any interface that suits the consumer, ie as API / via ODBC-SQL / stream ‘topic’
Interoperable and composable
The product needs to be accompanied with metadata on HOW it can be combined with other products
Valuable on its own
Useable without the need to first combine it with other data products
Secure
Data security / privacy needs to work on the product without needing ‘something else’
Data Mesh Data Product principles
26. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
26
…an extension / repackaging of the existing FAIR data
principles
https://www.go-fair.org/fair-principles/
This is not completely new…
FAIR DATA MESH
Findable Discoverable
Understandable
Accessible Addressable
Natively accessible
Secure
Interoperable Trustworthy and ‘truthful’
Interoperable and composable
Valuable on its own
Reusable Natively accessible
Trustworthy and ‘truthful’
Understandable
29. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
29
‘Pushing down’ DWH concerns to the operational systems
will likely be a long journey
In addition, a lot of the tech mentioned in the book to cover
some vital aspects of the data mesh does not exist yet
The alternative I see…
Use the DWH 3.0 / Lakehouse pattern
Make sure to cover the mentioned Data Mesh principles there
• I think the key there is to use Data Vault or other ELM-based data
modeling style as an enabler
Overall, this would be my starting point
when MVP-ing a ‘meshy architecture’
30. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
30
Subject Oriented
Create domain specific and centralized hubs
Create domain specific satellites
Integrated
Domain specific hubs should (obviously) be integrated within a domain
Centralized hubs should be fed from all domains and / or an central MDM
source
Centralized same as links should be created to enable integrating across
domains
• Domain specific satellites can then be shared too
Time variant & non volatile
Before the Subject Oriented / Integrated step, data should be loaded RAW in
a Historical Staging Area first
Implementing the ‘DWH concerns’ using
Data Vault
32. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
32
Implementing ‘data as a product’ in
the DWH 3.0 / Lakehouse pattern
Product Principle My first implementation ‘idea’ / suggestion
Discoverable • Data product metadata should be pushed or pulled from all domains towards a data catalog with a good
search interface where NO MANUAL CURATION is needed!
Addressable • If source entity names change, the entity names should remain stable. That’s also the purpose of a GOOD
data vault hub
Trustworthy and ‘truthful’ • Data quality test should be part of each data product, not having them should block releasing them
• The data catalog mentioned in discoverable should handle lineage
Natively accessible • Next to storing data products ‘in a database’, and using ODBC/ JDBC to access, create a data-API on top
of each data product or make sure it can be used via an API too
• Database systems with ‘low friction’ data sharing capabilities could help here
Interoperable and
composable
• Embed metadata from parent / child data products in the data product itself
• Again, a data catalog plays a central role here
Valuable on its own • ELM based data modeling patterns (like Data Vault) and a datamart modeling style like Kimball /
Dimensional is still the way to go
Secure • Use the native row and column level security features of modern cloud based analytical databases.
• Register these policies as metadata.
• This requires data product consumers to consume using named accounts only
34. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
34
All relevant data mesh info is collected at on
https://datameshlearning.com/user-stories/
Scott Hireleman drives his initiative
He also host the accompanying Data Mesh Radio podcast
• Listen to Shane Gibson’s (Knowledge Gap presenter) here:
https://daappod.com/data-mesh-radio/repeatable-patterns-and-data-mesh-shane-gibson/
Check out the user journeys here:
https://datameshlearning.com/user-stories/
LAST: Where can I find more info?
Our industry is still immature:
https://www.linkedin.com/pulse/note-piergiuseppe-bill-inmon/
Summary:
-One of the signs of immaturity of our industry is the practice of depending on vendors to lead the industry.
-Because we are a young and immature industry, there are new advancements that occur every day.
-Because of the newness of our industry there are very few principles. There are new toys. There are new gadgets.
-When a new product or technology comes into the marketplace, the vendor thinks that it is their duty to remove everything that has come before.
-There is a secret to combatting the vendors who are telling you that you are dated and old. The secret is to deliver business value to your end user.
Our industry is still immature:
https://www.linkedin.com/pulse/note-piergiuseppe-bill-inmon/
Summary:
-One of the signs of immaturity of our industry is the practice of depending on vendors to lead the industry.
-Because we are a young and immature industry, there are new advancements that occur every day.
-Because of the newness of our industry there are very few principles. There are new toys. There are new gadgets.
-When a new product or technology comes into the marketplace, the vendor thinks that it is their duty to remove everything that has come before.
-There is a secret to combatting the vendors who are telling you that you are dated and old. The secret is to deliver business value to your end user.
What does this number mean?
Yes, it is the failure rate of BI, analytics and classical datawarehousing initiatives
But also of Big data, Data lake, IOT or AI projects. It is the amount of ML work that never sees the light of day in your production enrironment
And I don’t make this up, it is being said again and again by the likes of gartner, forrester, CIO.com, Cisco
There are 2 reasons
Our industry is still immature:
https://www.linkedin.com/pulse/note-piergiuseppe-bill-inmon/
Summary:
-One of the signs of immaturity of our industry is the practice of depending on vendors to lead the industry.
-Because we are a young and immature industry, there are new advancements that occur every day.
-Because of the newness of our industry there are very few principles. There are new toys. There are new gadgets.
-When a new product or technology comes into the marketplace, the vendor thinks that it is their duty to remove everything that has come before.
-There is a secret to combatting the vendors who are telling you that you are dated and old. The secret is to deliver business value to your end user.
My take in on the primary reason WHY this lasting failure this is still happening relates to us still not addressing the data quality problem structurally
Quality is still an afterthought
Not only bull inmon, this is a guy working a t salesforce, modern cloud based saas company.
In my opinions solving / addressing / governing these data quality issues are implicitly the core of what datawarehousing methodology should address
My take in on the primary reason WHY this lasting failure this is still happening relates to us still not addressing the data quality problem structurally
Quality is still an afterthought
Not only bull inmon, this is a guy working a t salesforce, modern cloud based saas company.
In my opinions solving / addressing / governing these data quality issues are implicitly the core of what datawarehousing methodology should address
IMHO: it is ‘just’ a NEW form of Decentralized Data Warehousing
Is data mesh an architecture?
Is it a list of principles?
Is it an operating model?
After all, we rely on the classification of patterns as a major cognitive function to understand the structure of our world.
Hence, I have decided to classify data mesh as a sociotechnical paradigm: an approach that recognizes the interactions between people and the technical architecture and solutions in complex organizations
When people whould have read the data mesh book, even Zhamak herself writes down that it is still an emerging concept. That a lot of the tech to build what she is writing down conceptually DOES NOT EVEN EXIST (YET).
As such, no one can actual claim to 'sell' a data mesh or claim to have build a 'full fledged' one.
The only claim that could be true is:
that people are 'on a journey’ towards creating a data mesh
that vendors sell a tech component that might be applied when designing / building a data mesh.
Is data mesh an architecture?
Is it a list of principles?
Is it an operating model?
After all, we rely on the classification of patterns as a major cognitive function to understand the structure of our world.
Hence, I have decided to classify data mesh as a sociotechnical paradigm: an approach that recognizes the interactions between people and the technical architecture and solutions in complex organizations
That is quite a lot of people that will need to be protected, so read up!
Data mesh calls for a fundamental shift in the assumptions, architecture, technical solutions, and social structure of our organizations, in how we manage, use, and own analytical data:
Organizationally, it shifts from centralized ownership of data by specialists who run the data platform technologies to a decentralized data ownership model pushing ownership and accountability of the data back to the business domains where data is produced from or is used.
Architecturally, it shifts from collecting data in monolithic warehouses and lakes to connecting data through a distributed mesh of data products accessed through standardized protocols.
Technologically, it shifts from technology solutions that treat data as a byproduct of running pipeline code to solutions that treat data and code that maintains it as one lively autonomous unit.
Operationally, it shifts data governance from a top-down centralized operational model with human interventions to a federated model with computational policies embedded in the nodes on the mesh.
Principally, it shifts our value system from data as an asset to be collected to data as a product to serve and delight the data users (internal and external to the organization).
Infrastructurally, it shifts from two sets of fragmented and point-to-point integrated infrastructure services—one for data and analytics and the other for applications and operational systems to a well-integrated set of infrastructure for both operational and data systems.
datawarehousing is an activity, supported by a methodology. It has has nothing to do with technology directly
, it’s about adressing these data-analytical concerns
These four words say NOTHING about technology. Zip. Nada. They describe what functionally needs to happen.
In traditional DWH modeling apraches you still do this work ‘in one go’
But to be fair, that really is a problem:
What about
modelling time vs added value,
reverse engineering,
starting with a data first / data centric architecture?
Agility
https://www.gartner.com/smarterwithgartner/gartner-top-10-data-and-analytics-trends-for-2021/
https://www.slideshare.net/ParisDataEngineers/delta-lake-oss-create-reliable-and-performant-data-lake-by-quentin-ambard
Data Lakehouse: https://www.snowflake.com/guides/what-data-lakehouse
https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
https://medium.com/snowflake/selling-the-data-lakehouse-a9f25f67c906
Delta lake:https://docs.databricks.com/delta/index.html
Data Mesh:
https://francois-nguyen.blog/2021/03/07/towards-a-data-mesh-part-1-data-domains-and-teams-topologies/
https://martinfowler.com/articles/data-monolith-to-mesh.html
https://dpgmedia-engineering.medium.com/ddd-data-area-at-dpg-media-f0130e4d9766
Data mesh calls for a fundamental shift in the assumptions, architecture, technical solutions, and social structure of our organizations, in how we manage, use, and own analytical data:
If something is feasible, then you can do it without too much difficulty. When someone asks "Is it feasible?" the person is asking if you'll be able to get something done.
=Capable of being done with means at hand and circumstances as they are.
synonyms: executable, practicable, viable, workable
All this is work that (NOW) often does not happen and makes sense to think about, determine and measure!
The teams have the responsibility to provide data that is easily discoverable, understandable, accessible, and usable, known as data products. There are established roles such as data product owners in each cross-functional domain team that are responsible for data and sharing it successfully
This is missing from the book!
The teams have the responsibility to provide data that is easily discoverable, understandable, accessible, and usable, known as data products. There are established roles such as data product owners in each cross-functional domain team that are responsible for data and sharing it successfully
The teams have the responsibility to provide data that is easily discoverable, understandable, accessible, and usable, known as data products. There are established roles such as data product owners in each cross-functional domain team that are responsible for data and sharing it successfully
The teams have the responsibility to provide data that is easily discoverable, understandable, accessible, and usable, known as data products. There are established roles such as data product owners in each cross-functional domain team that are responsible for data and sharing it successfully
build own to companies:
Building a decentralized DWH in a database
Using centralized configurable cloud native infra (Snowflake, Bigquery, Databricks)
And therefore I don’t believe in cloud DWH as the answer that makes datawarehousing successful suddenly Who said this?