Data mesh is a relatively recent term that describes a set of principles that good modern data systems uphold. A kind of “microservices” for the data-centric world. While the data mesh is not technology-specific as a pattern, the building of systems that adopt and implement data mesh principles have a relatively long history under different guises.
In this talk, we share our recommendations and picks of what every developer should know about building a streaming data mesh with Kafka. We introduce the four principles of the data mesh: domain-driven decentralization, data as a product, self-service data platform, and federated governance. We then cover topics such as the differences between working with event streams versus centralized approaches and highlight the key characteristics that make streams a great fit for implementing a mesh, such as their ability to capture both real-time and historical data. We’ll examine how to onboard data from existing systems into a mesh, modelling the communication within the mesh, how to deal with changes to your domain’s “public” data, give examples of global standards for governance, and discuss the importance of taking a product-centric view on data sources and the data sets they share.
08448380779 Call Girls In Friends Colony Women Seeking Men
Apache Kafka and the Data Mesh | Michael Noll, Confluent
1. Apache Kafka and the Data Mesh
Ben Stopford, Michael G. Noll
Office of the CTO, Confluent
Kafka Summit Americas, September 14-15, 2021
2. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
What is Data Mesh?
2
Data Marts DDD Microservices Event Streaming
Domain
Inventory
Orders
Shipments
Data Product
Data Mesh
...
3. Data ownership by
domain
Data as a product Data governed
wherever it is
Data available
everywhere, self
serve
1 2 3 4
The Principles of a Data Mesh
5. Domain-driven
Decentralization
Local Autonomy
(Organizational Concerns)
Data as a
First-class Product
Product thinking,
“Microservice for Data”
Federated
Governance
Interoperability,
Network Effects
(Organizational Concerns)
Self-serve
Data Platform
Infra Tooling,
Across Domains
1 2 3 4
The Principles of a Data Mesh
6. Principle 1: Domain-driven Decentralization
Pattern: Ownership of a data asset given to
the “local” team that is most familiar with it
Centralized
Data Ownership
Decentralized
Data Ownership
Objective: Ensure data is owned by those that truly understand it
Anti-pattern: responsibility for data
becomes the domain of the DWH team
7. Principle 1: Domain-driven Decentralization
Pattern: Ownership of a data asset given to
the “local” team that is most familiar with it
Centralized
Data Ownership
Decentralized
Data Ownership
Objective: Ensure data is owned by those that truly understand it
Anti-pattern: responsibility for data
becomes the domain of the DWH team
8. Data Mesh is about Connectivity
8
“Instead of collecting, you want to come up with a
model that allows connectivity of the data.”
Zhamak Dehghani
9. 9
Shipping Data
Joe
Practical example
1. Joe in Inventory has a problem with
Order data.
2. Inventory items are going negative,
because of bad Order data.
3. He could fix the data up locally in the
Inventory domain, and get on with his
job.
4. Or, better, he contacts Alice in Orders and
get it fixed at the source. This is more
reliable as Joe doesn’t fully understand
the Orders process.
5. Ergo, Alice needs be an responsible &
responsive “Data Product Owner”, so
everyone benefits from the fix to Joe’s
problem.
Orders Domain Shipment Domain
Order Data
Inventory Billing Recommendations
Alice
10. 10
Shipping Data
Joe
Practical example
1. Joe in Inventory has a problem with
Order data.
2. Inventory items are going negative,
because of bad Order data.
3. He could fix the data up locally in the
Inventory domain, and get on with his
job.
4. Or, better, he contacts Alice in Orders and
get it fixed at the source. This is more
reliable as Joe doesn’t fully understand
the Orders process.
5. Ergo, Alice needs be an responsible &
responsive “Data Product Owner”, so
everyone benefits from the fix to Joe’s
problem.
Orders Domain Shipment Domain
Order Data
Inventory Billing Recommendations
Alice
11. Recommendations: Domain-driven Decentralization
11
Learn from DDD:
• Use a standard language and nomenclature for data.
• Business users should understand a data flow diagram.
• The stream of events should create a shared narrative that is business-user comprehensible.
12. 1 2 3 4
Domain-driven
Decentralization
Local Autonomy
(Organizational Concerns)
Data as a
First-class Product
Product thinking,
“Microservice for Data”
Federated
Governance
Interoperability,
Network Effects
(Organizational Concerns)
Self-serve
Data Platform
Infra Tooling,
Across Domains
The Principles of a Data Mesh
13. Principle 2: Data as a First-Class Product
13
• Objective: Make shared data discoverable, addressable, trustworthy, secure, so other
teams can make good use of it.
• Data is treated as a true product, not a by-product.
This product thinking is important to prevent data chauvinism.
14. Infra
Code
Data product, a “microservice for the data world”
14
• Data product is a node on the data mesh, situated within a domain.
• Produces—and possibly consumes—high-quality data within the mesh.
• Encapsulates all the elements required for its function, namely data + code + infrastructure.
Data
Creates, manipulates,
serves, etc. that data
Powers the data (e.g., storage) and the
code (e.g., run, deploy, monitor)
“Items about to expire”
Data Product
Data and metadata,
including history
16. 16
...naturally to Event Streaming with Kafka
Domain
Data Product
Mesh is a logical view, not physical!
Data Mesh
17. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. 17
Event Streaming is Pub/Sub, not Point-to-Point
Data
Product
Data
Product
Data
Product
Data
Product
stream
(persisted) other streams
write
(publish)
read
(consume)
independently
Data producers are scalably decoupled from consumers.
18. Data Product
Data Product
Why is Event Streaming a good fit for meshing?
18
0 1 2 3 4 5 6 1
7
Streams are real-time, low latency ⇒ Propagate data immediately.
Streams are highly scalable ⇒ Handle today’s massive
data volumes.
Streams are stored, replayable ⇒ Capture real-time & historical
data.
Streams are immutable ⇒ Auditable source of
record.
19. How to get data into & out of a data product
19
Data Product
Input
Data
Ports
Output
Data
Ports
Snapshot via
Nightly ETL
Snapshot via
Nighty ETL
Continuous
Stream
Snapshot via
Req/Res API
Snapshot via
Req/Res API
1
2
3
Continuous
Stream
21. Data product: what’s happening inside
21
Input
Data
Ports
Output
Data
Ports
…pick your favorites...
Data on the Inside: HOW the domain team solves specific problems
internally? This doesn’t matter to other domains.
22. Event Streaming inside a data product
22
Input
Data
Ports
Output
Data
Ports
ksqlDB to filter,
process, join,
aggregate, analyze
Stream data from
other DPs or
internal systems
into ksqlDB
1 2 Stream data to
internal systems or
the outside. Pull
queries can drive a
req/res API.
3
Req/Res API
Pull Queries
Use ksqlDB, Kafka Streams apps, etc. for processing data in motion.
23. Use Kafka connectors and CDC to “streamify” classic databases.
Event Streaming inside a data product
23
Input
Data
Ports
Output
Data
Ports
MySQL
Sink
Connector
Source
Connector
DB client apps
work as usual
Stream data from
other Data Products
into your local DB
Stream data to the outside
with CDC and e.g. the
Outbox Pattern, ksqlDB, etc.
1 3
2
24. Dealing with data change: schemas & versioning
24
Data
Product
Output
Data
Ports
V1 - user, product, quantity
V2 - userAnonymized, product, quantity
Also, when needed, data can be fully reprocessed by replaying history.
Publish evolving streams with back/forward-compatible schemas.
Publish versioned streams for breaking changes.
26. Recommendations: Data as a First-class Product
26
1. Data-on-the-Outside is harder to change, but it has more value in a holistic sense.
a. Use schemas as a contract and to find data.
b. Handle incompatible schema changes using Dual Schema Upgrade Window pattern.
2. Get data from the source, not from intermediaries.
a. Otherwise, proliferation of ‘slightly corrupt’ data within the mesh. “Game of Telephone”.
b. Event Streaming makes it easy to subscribe to data from authoritative sources.
3. Change data at the source, including error fixes. Don’t “fix data up” locally.
4. Some data sources will be difficult to turn into first-class data products. Example: Batch-
based sources that lose event-level data or are not reproducible.
a. Use Event Streaming plus CDC, Outbox Pattern, etc. to integrate these into the mesh.
27. Recommendations: Data as a First-class Product
27
1. Data-on-the-Outside is harder to change, but it has more value in a holistic sense.
a. Use schemas as a contract and to find data.
b. Handle incompatible schema changes using Dual Schema Upgrade Window pattern.
2. Get data from the source, not from intermediaries.
a. Otherwise, proliferation of ‘slightly corrupt’ data within the mesh. “Game of Telephone”.
b. Event Streaming makes it easy to subscribe to data from authoritative sources.
3. Change data at the source, including error fixes. Don’t “fix data up” locally.
4. Some data sources will be difficult to turn into first-class data products. Example: Batch-
based sources that lose event-level data or are not reproducible.
a. Use Event Streaming plus CDC, Outbox Pattern, etc. to integrate these into the mesh.
28. Recommendations: Data as a First-class Product
28
1. Data-on-the-Outside is harder to change, but it has more value in a holistic sense.
a. Use schemas as a contract and to find data.
b. Handle incompatible schema changes using Dual Schema Upgrade Window pattern.
2. Get data from the source, not from intermediaries. Think: Demeter's law applied to data.
a. Otherwise, proliferation of ‘slightly corrupt’ data within the mesh. “Game of Telephone”.
b. Event Streaming makes it easy to subscribe to data from authoritative sources.
3. Change data at the source, including error fixes. Don’t “fix data up” locally.
4. Some data sources will be difficult to turn into first-class data products. Example: Batch-
based sources that lose event-level data or are not reproducible.
a. Use Event Streaming plus CDC, Outbox Pattern, etc. to integrate these into the mesh.
29. 1 2 3 4
Domain-driven
Decentralization
Local Autonomy
(Organizational Concerns)
Data as a
First-class Product
Product thinking,
“Microservice for Data”
Federated
Governance
Interoperability,
Network Effects
(Organizational Concerns)
Self-serve
Data Platform
Infra Tooling,
Across Domains
The Principles of a Data Mesh
30. Why Self-service Matters
30
Trade Surveillance System
● Data from 13 sources
● Some sources publish events
● Needed both historical and real-time data
● Historical data from database extracts arranged with dev
team.
● Format of events different to format of extracts
● 9 months of effort to get 13 sources into the new system.
31. Why Self-service Matters
31
Trade Surveillance System
● Data from 13 sources
● Some sources publish events
● Needed both historical and real-time data
● Historical data from database extracts arranged with dev
team.
● Format of events different to format of extracts
● 9 months of effort to get 13 sources into the new system.
32. Principle 3: Self-serve Data Platform
32
Central infrastructure that provides real-time and historical data on demand
Objective: Make domains autonomous in their execution through rapid data provisioning
33. Consuming real-time & historical data from the mesh
33
1) Separate Systems for Real-time and Historical Data (Lambda Architecture)
- Considerations:
- Difficulty to correlate real-time with historical “snapshot” data
- Two systems to manage
- Unlike event streams, snapshots have less granularity
1) One System for Real-time and Historical Data (Kappa Architecture)
- Considerations:
- Operational complexity (addressed in Confluent Cloud)
- Downsides of immutability of regular streams: e.g. altering or deleting events
- Storage cost (addressed in Confluent Cloud, in Apache Kafka with KIP-405)
34. What this can look like in practice
34
Browse Schemas
35. 35
With ksqlDB the data mesh is queryable and
decentralized.
Destination
Data Port
STREAM
PROCESSOR
ksqlDB
Query is the interface
to the mesh
Events are the interface to
the mesh
36. 36
Mesh is one logical cluster. Data product has another.
Data
Product
Data Product has its own
cluster for internal use
37. 1 2 3 4
Domain-driven
Decentralization
Local Autonomy
(Organizational Concerns)
Data as a
First-class Product
Product thinking,
“Microservice for Data”
Federated
Governance
Interoperability,
Network Effects
(Organizational Concerns)
Self-serve
Data Platform
Infra Tooling,
Across Domains
The Principles of a Data Mesh
38. Principle 4: Federated Governance
38
• Objective: Independent data products can interoperate and create network effects.
• Establish global standards, like governance, that apply to all data products in the mesh.
• Ideally, these global standards and rules are applied automatically by the platform.
Domain Domain Domain Domain
Self-serve Data Platform
What is decided
locally by a domain?
What is globally?
(implemented and
enforced by platform)
Must balance between Decentralization vs. Centralization. No silver bullet!
39. Example standard: Identifying customers globally
• Define how data is represented, so you can join and correlate data across different domains.
• Use data contracts, schemas, registries, etc. to implement and enforce such standards.
• Use Event Streaming to retrofit historical data to new requirements, standards.
39
customerId=29639
customerId=29639
customerId=29639
customerId=29639
SELECT … FROM orders o
LEFT JOIN shipments s
ON o.customerId = s.customerId
EMIT CHANGES;
40. Example standard: Detect errors and recover with Streams
40
• Use strategies like logging, data profiling, data lineage, etc. to detect errors in the mesh.
• Streams are very helpful to detect errors and identify cause-effect relationships.
• Streams let you recover and fix errors: e.g., replay & reprocess historical data.
Data
Product
Output
Data
Ports
0 1 2 3 4 5 6 7 8 9
My App
Bug? Error? Rewind
to start of stream,
then reprocess.
If needed, tell the origin data product to fix problematic data at the source.
Event Streams give
you a powerful
Time Machine.
41. Example standard: Tracking data lineage with Streams
41
• Lineage must work across domains and data products—and systems, clouds, data centers.
• Event streaming is a foundational technology for this.
On-premise
42. Recommendations: Federated Governance
42
1. Be pragmatic: Don’t expect governance systems to be perfect.
a. They are a map that helps you navigate the data-landscape of your company.
b. But there will always be roads that have changed or have not been mapped.
2. Governance is more a process—i.e., an organizational concern—than a technology.
3. Beware of centralized data models, which can become slow to change. Where they must
exist, use processes & tooling like GitHub to collaborate and change quickly. Good luck! 🙂
44. Data Mesh Journey
44
Principle 1
Data should have one owner:
the team that creates it.
Principle 2
Data is your product:
All exposed data should
be good data.
Principle 3
Get access to any data
immediately and painlessly,
be it historical or real-time.
Principle 4: Governance, with standards, security,
lineage, etc. (cross-cutting concerns)
Difficulty
to execute
Start Here
1
2
3
45. Data Mesh Journey
45
Principle 1
Data should have one owner:
the team that creates it.
Principle 2
Data is your product:
All exposed data should
be good data.
Principle 3
Get access to any data
immediately and painlessly,
be it historical or real-time.
Principle 4: Governance, with standards, security,
lineage, etc. (cross-cutting concerns)
Difficulty
to execute
1
2
3
Start Here
46. Data Mesh Journey
46
Principle 1
Data should have one owner:
the team that creates it.
Principle 2
Data is your product:
All exposed data should
be good data.
Principle 3
Get access to any data
immediately and painlessly,
be it historical or real-time.
Principle 4: Governance, with standards, security,
lineage, etc. (cross-cutting concerns)
Difficulty
to execute
1
2
3
Start Here
47. Implement a Data Mesh: Cheat Sheet
- Secure event streams
Access to event streams is
permissioned by a central body.
- Connect from any database
Sink Connectors are made available
for all supported database types to
ease the provisioning of new output
data ports in the mesh.
- Central user interface
- Discovery and registration of event
streams
- Searching schemas for data of
interest
- Previewing event streams
- Requesting to access event streams.
- Data lineage views
47
- Centralize data in motion
Introduce a central event streaming
platform.
- Nominate data owners
Firm owners for all key datasets in the
organization. Make ownership
information broadly accessible.
- Data on demand.
Events are either stored in Kafka
indefinitely or can be republished by
data products on demand.
- Handle Schema Change
Owners publish schema information
to the mesh. Process introduced for
schema change approval.
48. developer.confluent.io
• Free Courses on all things Kafka
and Event Streaming
• 50+ Design Patterns for Event
Streaming
• And more: Quickstarts, Tutorials, ...