Data Mesh in Practice
Max Schultze - max.schultze@zalando.de
Arif Wider - awider@thoughtworks.com
25-06-2020
How Europe’s Leading
Online Platform for Fashion
Goes Beyond the Data Lake
@mcs1408 @arifwider
2
Max Schultze
● Lead Data Engineer
● MSc in Computer Science
● Took part in early
development of Apache Flink
● Retired semi-professional
Magic: the Gathering player
Who are we?
Arif Wider
● Lead Technology Consultant
● Head of AI, ThoughtWorks Germany
● Scala & FP enthusiast
● Coffee geek
3
TABLE OF
CONTENTS
Zalando Analytics Cloud Journey
What’s this Data Mesh?
Data Mesh in Practice
4
Zalando Analytics Cloud
Journey
5
Legacy Analytics
DWH
6
Messaging
Bus
Data Lake
Legacy Evolving
7
Zalando’s Data Lake
Ingestion
Storage
Serving
8
Zalando’s Data Lake
Web
Tracking
Event Bus
DWH
Data Center
Ingestion
Storage
Serving
9
Zalando’s Data Lake
Web
Tracking
Event Bus
DWH
Data Center
Ingestion
Storage
Serving
Metastore
10
Zalando’s Data Lake
Data CatalogWeb
Tracking
Event Bus
DWH
Data Center
Ingestion
Storage
Serving
Metastore
Fast Query Layer
Processing Platform
11
Centralization Challenges
Datasets provided by data agnostic infrastructure team
● Lack of ownership
?
12
Field_A Field_B
Record_1
Record_2
Record_3
Datasets provided by data agnostic infrastructure team
● Lack of ownership
Pipeline responsibility on data agnostic infrastructure team
● Lack of quality
Centralization Challenges
13
Centralization Challenges
Datasets provided by data agnostic infrastructure team
● Lack of ownership
Pipeline responsibility on data agnostic infrastructure team
● Lack of quality
Organizational scaling
● Central team becomes the bottleneck
14
A Recurring Pattern
Source domain
product teams
generating data
Teams, decisions
makers, data scientists
consuming data
Data & ML engineers
maintaining the
data platform
15
A Recurring Pattern
Source domain
product teams
generating data
Teams, decisions
makers, data scientists
consuming data
Data & ML engineers
maintaining the
data platform
16
A Recurring Pattern
Source domain
product teams
generating data
Teams, decisions
makers, data scientists
consuming data
Data & ML engineers
maintaining the
data platform
17
A Recurring Pattern
Source domain
product teams
generating data
Data & ML engineers
maintaining the
data platform
Teams, decisions
makers, data scientists
consuming data
18
Why is that?
central
data platform
19
Why is that?
checkout
service
checkout
events
20
What is Data Mesh?
Old wine applied to new bottles…
→ Product Thinking
→ Domain-Driven Distributed Architecture
→ Infrastructure as a Platform
… creates value from Data
https://martinfowler.com/articles/data-monolith-to-mesh.html by Zhamak Dehghani
21
Data as a Product
Data
Product
What is my market?
What are the desires of
my customers?
What “price” is justified?
How to do marketing?
What’s the USP?
Are my customers happy?
22
Domain-Driven Distributed Architecture… applied to Data
Domain
22
23
Domain-Driven Distributed Architecture… applied to Data
Domain
23
→ The Data Product is the
fundamental building block
Aggregated
Domain
24
Domain-Driven Distributed Architecture… applied to Data
Discoverable
Addressable
Self-describing
Trustworthy
Interoperable
(governed by open
standard)
Secure (governed by
global access control)
Domain
24
→ The Data Product is the
fundamental building block
Aggregated
Domain
25
...backed by domain-agnostic self-service data infrastructure
Data Infra as a Platform
Storage, pipeline, catalogue, access control, etc
Data infra
engineers
Discoverable
Addressable
Self-describing
Trustworthy
Interoperable
(governed by open
standard)
Secure (governed by
global access control)
Domain
25
→ The Data Product is the
fundamental building block
Aggregated
Domain
26
It’s a mindset shift
FROM TO
Centralized ownership Decentralized ownership
Pipelines as first class concern Domain Data as first class concern
Data as a by-product Data as a Product
Siloed Data Engineering Team Cross-functional Domain-Data Teams
Centralized Data Lake / Warehouse Ecosystem of Data Products
27
Data Mesh in Practice
28
Recap:
● From Bottleneck to Infra Platform
Data Mesh in Practice
Data Infra as a Platform
Storage, pipeline, catalogue, access control, etc
29
Recap:
● From Bottleneck to Infra Platform
● From Data Monolith to Interoperable Services
Data Mesh in Practice
Data Infra as a Platform
Storage, pipeline, catalogue, access control, etc
central
data
platform
30
Data Lake Storage
Metadata Layer
Central Services with Global Interoperability
31
Data Lake Storage
Metadata Layer
Bring Your Own Bucket (BYOB)
32
Data Lake Storage
Processing Platform
Metadata Layer
Central Processing Platform
33
Data Lake Storage
Processing Platform
Metadata Layer
Simplify Data Sharing
34
Central Services with Global Interoperability
Decentralized ownership does not imply decentralized infrastructure!
Interoperability is created through convenient solutions of a self service platform.
Decentral Storage Central Infrastructure
Decentral Ownership Central Governance
35
Recap:
● Datasets provided through pipelines of data agnostic infrastructure teams
Data Mesh in Practice
?
36
How to Ensure Data Quality?
Make conscious decisions
● Opt-in instead of default storage
37
How to Ensure Data Quality?
Make conscious decisions
● Opt-in instead of default storage
● Classification of data usage
38
Data Quality - A Contract between Consumer and Producer
Behavioral changes for data producers
● Data is a product not a by-product
39
Behavioral changes for data producers
● Data is a product not a by-product
● Dedicate resources to
○ Understand usage
○ Ensure quality
Data Quality - A Contract between Consumer and Producer
40
Data Mesh in Practice
How Europe’s Leading
Online Platform for Fashion
Goes Beyond the Data Lake
Max Schultze
max.schultze@zalando.de
@mcs1408
Arif Wider
awider@thoughtworks.com
@arifwider

Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes Beyond the Data Lake