Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

3

Share

Download to read offline

Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes Beyond the Data Lake

Download to read offline

The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.

Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes Beyond the Data Lake

  1. 1. Data Mesh in Practice Max Schultze - max.schultze@zalando.de Arif Wider - awider@thoughtworks.com 25-06-2020 How Europe’s Leading Online Platform for Fashion Goes Beyond the Data Lake @mcs1408 @arifwider
  2. 2. 2 Max Schultze ● Lead Data Engineer ● MSc in Computer Science ● Took part in early development of Apache Flink ● Retired semi-professional Magic: the Gathering player Who are we? Arif Wider ● Lead Technology Consultant ● Head of AI, ThoughtWorks Germany ● Scala & FP enthusiast ● Coffee geek
  3. 3. 3 TABLE OF CONTENTS Zalando Analytics Cloud Journey What’s this Data Mesh? Data Mesh in Practice
  4. 4. 4 Zalando Analytics Cloud Journey
  5. 5. 5 Legacy Analytics DWH
  6. 6. 6 Messaging Bus Data Lake Legacy Evolving
  7. 7. 7 Zalando’s Data Lake Ingestion Storage Serving
  8. 8. 8 Zalando’s Data Lake Web Tracking Event Bus DWH Data Center Ingestion Storage Serving
  9. 9. 9 Zalando’s Data Lake Web Tracking Event Bus DWH Data Center Ingestion Storage Serving Metastore
  10. 10. 10 Zalando’s Data Lake Data CatalogWeb Tracking Event Bus DWH Data Center Ingestion Storage Serving Metastore Fast Query Layer Processing Platform
  11. 11. 11 Centralization Challenges Datasets provided by data agnostic infrastructure team ● Lack of ownership ?
  12. 12. 12 Field_A Field_B Record_1 Record_2 Record_3 Datasets provided by data agnostic infrastructure team ● Lack of ownership Pipeline responsibility on data agnostic infrastructure team ● Lack of quality Centralization Challenges
  13. 13. 13 Centralization Challenges Datasets provided by data agnostic infrastructure team ● Lack of ownership Pipeline responsibility on data agnostic infrastructure team ● Lack of quality Organizational scaling ● Central team becomes the bottleneck
  14. 14. 14 A Recurring Pattern Source domain product teams generating data Teams, decisions makers, data scientists consuming data Data & ML engineers maintaining the data platform
  15. 15. 15 A Recurring Pattern Source domain product teams generating data Teams, decisions makers, data scientists consuming data Data & ML engineers maintaining the data platform
  16. 16. 16 A Recurring Pattern Source domain product teams generating data Teams, decisions makers, data scientists consuming data Data & ML engineers maintaining the data platform
  17. 17. 17 A Recurring Pattern Source domain product teams generating data Data & ML engineers maintaining the data platform Teams, decisions makers, data scientists consuming data
  18. 18. 18 Why is that? central data platform
  19. 19. 19 Why is that? checkout service checkout events
  20. 20. 20 What is Data Mesh? Old wine applied to new bottles… → Product Thinking → Domain-Driven Distributed Architecture → Infrastructure as a Platform … creates value from Data https://martinfowler.com/articles/data-monolith-to-mesh.html by Zhamak Dehghani
  21. 21. 21 Data as a Product Data Product What is my market? What are the desires of my customers? What “price” is justified? How to do marketing? What’s the USP? Are my customers happy?
  22. 22. 22 Domain-Driven Distributed Architecture… applied to Data Domain 22
  23. 23. 23 Domain-Driven Distributed Architecture… applied to Data Domain 23 → The Data Product is the fundamental building block Aggregated Domain
  24. 24. 24 Domain-Driven Distributed Architecture… applied to Data Discoverable Addressable Self-describing Trustworthy Interoperable (governed by open standard) Secure (governed by global access control) Domain 24 → The Data Product is the fundamental building block Aggregated Domain
  25. 25. 25 ...backed by domain-agnostic self-service data infrastructure Data Infra as a Platform Storage, pipeline, catalogue, access control, etc Data infra engineers Discoverable Addressable Self-describing Trustworthy Interoperable (governed by open standard) Secure (governed by global access control) Domain 25 → The Data Product is the fundamental building block Aggregated Domain
  26. 26. 26 It’s a mindset shift FROM TO Centralized ownership Decentralized ownership Pipelines as first class concern Domain Data as first class concern Data as a by-product Data as a Product Siloed Data Engineering Team Cross-functional Domain-Data Teams Centralized Data Lake / Warehouse Ecosystem of Data Products
  27. 27. 27 Data Mesh in Practice
  28. 28. 28 Recap: ● From Bottleneck to Infra Platform Data Mesh in Practice Data Infra as a Platform Storage, pipeline, catalogue, access control, etc
  29. 29. 29 Recap: ● From Bottleneck to Infra Platform ● From Data Monolith to Interoperable Services Data Mesh in Practice Data Infra as a Platform Storage, pipeline, catalogue, access control, etc central data platform
  30. 30. 30 Data Lake Storage Metadata Layer Central Services with Global Interoperability
  31. 31. 31 Data Lake Storage Metadata Layer Bring Your Own Bucket (BYOB)
  32. 32. 32 Data Lake Storage Processing Platform Metadata Layer Central Processing Platform
  33. 33. 33 Data Lake Storage Processing Platform Metadata Layer Simplify Data Sharing
  34. 34. 34 Central Services with Global Interoperability Decentralized ownership does not imply decentralized infrastructure! Interoperability is created through convenient solutions of a self service platform. Decentral Storage Central Infrastructure Decentral Ownership Central Governance
  35. 35. 35 Recap: ● Datasets provided through pipelines of data agnostic infrastructure teams Data Mesh in Practice ?
  36. 36. 36 How to Ensure Data Quality? Make conscious decisions ● Opt-in instead of default storage
  37. 37. 37 How to Ensure Data Quality? Make conscious decisions ● Opt-in instead of default storage ● Classification of data usage
  38. 38. 38 Data Quality - A Contract between Consumer and Producer Behavioral changes for data producers ● Data is a product not a by-product
  39. 39. 39 Behavioral changes for data producers ● Data is a product not a by-product ● Dedicate resources to ○ Understand usage ○ Ensure quality Data Quality - A Contract between Consumer and Producer
  40. 40. 40 Data Mesh in Practice How Europe’s Leading Online Platform for Fashion Goes Beyond the Data Lake Max Schultze max.schultze@zalando.de @mcs1408 Arif Wider awider@thoughtworks.com @arifwider
  • garry.ford

    Aug. 7, 2021
  • jmpeg

    Aug. 6, 2021
  • abhishekcreate

    Nov. 26, 2020

The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.

Views

Total views

1,346

On Slideshare

0

From embeds

0

Number of embeds

4

Actions

Downloads

171

Shares

0

Comments

0

Likes

3

×