Have you heard about Data Mesh but never really understood how you actually build one? Data mesh is a relatively recent term that describes a set of principles that good modern data systems uphold. Although the data mesh is not a technology specific pattern, it requires that organizations make choices and investments into specific technologies and operational policies when implementing the mesh. Establishing ""paved roads"" for creating, publishing, evolving, deprecating, and discovering data products is essential for bringing the benefits of the mesh to those who would use it.
In this talk, Adam covers implementing a self-service data mesh with events streams in Apache Kafka®. Event streams as a data product are an essential part of a real-world data mesh, as they enable both operational and analytical workloads from a common source of truth. Event streams provide full historical data along with realtime updates, letting each individual data product consumer decide what to consume, how to remodel it, and where to store it to best suit their needs.
Adam structures this talk by seeking to answer a hypothetical SaaS business question of ""what is the relationship between feature usage and user retention?"" This example explores each team's role in the data mesh, including the data products they would (and wouldn't) publish, how other teams could use the products, and the organizational dynamics and principles underpinning it all.
5. Domain
Ownership
Local Autonomy
(Organizational Concerns)
Data as a
First-class Product
Product thinking,
“Microservice for Data”
Federated
Governance
Interoperability,
Network Effects
(Organizational Concerns)
Self-serve
Data Platform
Infra Tooling,
Across Domains
1 2 3 4
6. Principle 1: Domain Ownership
Objective: Data is owned by those that truly understand it
Pattern: Data belongs to the team
who understands it best
Centralized
Data Ownership
Decentralized
Data Ownership
Anti-pattern: Centralized team owns all data
7. Data Lake: Ownership rests with a centralized data team
Domain Foo
Domain Bar Data Domain
Connector
Connector
Clean Up &
Remodel
Clean Up &
Remodel
ksqlDB
Kafka Streams
Renegotiate Domain Ownership
Renata
Alice
Joe
Centralized Data
8. Data Mesh: Ownership rests with the domain
Domain Foo
Domain Bar
Connector
Connector
Clean Up &
Remodel
Clean Up &
Remodel
Alice
Joe
Self-Service Platform
Self-Service Platform
ksqlDB
Kafka Streams
End-to-End Ownership
Platform
Support
Renata
De-centralized Data
9. Principle 2: Data as a First-Class Product
Objective: Make shared data discoverable, addressable, trustworthy, secure, so
other teams can make good use of it.
● Data is treated as a true product, not a by-product.
10. Domain
Data Product
Data Product, a “Microservice for the Data World”
● Data product is a node on the data mesh, situated within a domain.
● Produces—and possibly consumes—high-quality data within the mesh.
Infra
Code
Data
Creates, manipulates,
serves, etc. that data
Powers the data (e.g., storage) and the
code (e.g., run, deploy, monitor)
“Items About to Expire”
Data Product
Data and metadata,
including history
11. Principle 3: Self-Serve Data Platform
Provide discovery, access, and self-service compute and publish tools
Objective: Make it easy to both create and use the data products
12. Domain Domain Domain
Domain
Principle 4: Federated Governance
Objective: Standards of Interoperability, Policies, and Support
● Global standards, data product support. “Paved Roads”.
Self-Serve Data Platform
What is decided
locally by a domain?
What is globally?
(implemented and
enforced by platform)
Must balance between Decentralization vs. Centralization. No silver bullet!
14. Data Products Base Requirements
● Immutable
○ Consumers across time provided with the same data
● Time-Stamped
○ Support time-bounded queries and operations
● Well-defined (Schemas)
○ Clarity as to what the data means
15. Event Streams Provide An Immutable History
Consumer
Application
Data
Product
0 1 2 3 4 5 6 7 8 9
Bug? Error?
New Aggregate?
Rewind to start of
stream, then
reprocess.
Event Streams let your
consumers replay data
as needed.
Kappa Architecture
16. ● Store all the data you need, for as long as you need it.
● Cheap disk! Compaction!
● Confluent Cloud’s Infinite Storage
● OSS: KIP-405: Kafka Tiered Storage (Targetting Kafka 3.3)
Event Streams: Massively Scalable
17. Events are Well-Defined
0 1 2 3 4 5 6 7 8 9
Key String ID-2910312
Value String itemName
String Brand
String Construction
Float Price
Baseball Bat
ACME
Wood
29.99
Time String 2022-04-07T14:51:44Z
Kafka Topic + Schema = REST + json
The stream API:
18. Events are Time-Stamped
0 1 2 3 4 5 6 7 8 9
Time-Stamped Data and
incremental offsets
enable deterministic
reprocessing
Key String ID-2910312
Value String itemName
String Brand
String Construction
Float Price
Baseball Bat
ACME
Wood
29.99
Time String 2022-04-07T14:51:44Z
19. Event Streams Power Realtime & Batch Processing
All Data
(current and historic)
Streaming
Operational App
Streaming
Analytics
Connector
Connector
Batch-Computed
Analytics
Traditional R/R
Operational App
Millisecond
end-to-end latency
Both operational and
analytical workloads!
21. Learn a Language Application
● Lesson types: Written, Audio, Video, Stories, Flashcards
1. What lessons do students fail to complete (24h)? (Analytical)
2. Can we push them lessons based on what they’ve failed? (Operational)
3. Expand the domains to account for paid users (Both)
22. ● Serves content to users
● Collects metrics on users
completing and failing lessons
● User Accounts
● Includes private details, PII,
Payment Info
USERS
SERVING
● Lessons, including written,
video, audio, and flashcards
LESSONS
Alice
Joe
Maria
Simplification!
Could have many more domains
Masters of Their Domains
23. USERS
User Account Data
Alice
Users
DP key: UserId-6384291
Name: Adam Bellemare
Address: Canada
Email: k2hd9@9fd9s.com
Timestamp: 2022-04-07T15:19:47Z
Event Schema API
Isolates internal model
Format-preserving Encryption
User Accounts Maintained Within a Single Domain
25. Joe
SERVING
key: UserId-6384291
LessonId: AID-2729
Type: Audio
Status: Completed
Content Serving Domain - Source and Aggregate
DP
Source-Aligned
Data Product
key: UserId-6384291
Completed: <List of Lessons>
Failed: <List of Lessons>
StartDate: 2022-02-02 UTC-0
EndDate: 2022-02-03 UTC-0
Aggregate-Aligned
Data Product
DP
26. Lessons Domain - Source Aligned Data Product
LESSONS
Maria
Lessons
DP
key: LessonID-623
assets: S3://….
medium: Written
subject: Verbs
difficulty: Intermediate
Source-Aligned
Data Product
27. 1) What Lessons do Students Fail to Complete?
Compute 24h course completion and failure rates:
- Could created our own aggregate using:
OR
- Could use the pre-built aggregate-aligned data product
key: UserId-6384291
Completed: <List of Lessons>
Failed: <List of Lessons>
StartDate: 2022-02-02 UTC-0
EndDate: 2022-02-03 UTC-0
Aggregate-Aligned
Data Product
key: UserId-6384291
LessonId: AID-2729
Type: Audio
Status: Failed
Source-Aligned
Data Product
SERVING
SERVING
28. 1) Select the Data Products
Joe
LESSONS
Maria
key: UserId-6384291
Completed: <List of Lessons>
Failed: <List of Lessons>
StartDate: 2022-02-02 UTC-0
EndDate: 2022-02-03 UTC-0
Aggregate-Aligned
Data Product
key: LessonID-623
assets: S3://….
medium: Written
subject: Verbs
difficulty: Intermediate
Source-Aligned
Data Product
Join the List of Failed Lessons
with the Lesson Content
SERVING
29. 1) Create a New Processor and Emit Results
ksqlDB
ANALYTICS
BI Tool
Joe
Content Serving
Domain
LESSONS
Maria
key: UserId-6384291
Completed: <List of Lessons>
Failed: <List of Lessons>
StartDate: 2022-02-02 UTC-0
EndDate: 2022-02-03 UTC-0
key: LessonID-623
assets: S3://….
medium: Written
subject: Verbs
difficulty: Intermediate
SERVING
30. 1) OR use Connectors to Integrate with Batch Data
Joe
Content Serving
Domain
LESSONS
Maria
Batch
Analytics
Engine
ANALYTICS
BI Tool
Connect
Connect
Cloud
Storage
Cloud
Storage
SERVING
31. 2) Push New Lessons to User Based on Failures
Joe
Content Serving
Domain
LESSONS
Maria
key: LessonID-623
assets: S3://….
medium: Written
subject: Verbs
difficulty: Intermediate
key: UserId-6384291
LessonId: LessonID-623
Type: Audio
Status: Failed
User failed lesson?
- Find them a new one based on
subject, medium, and difficulty
User passed lesson?
- Offer them a more challenging one
Source-Aligned
Data Products
Operational Use-Case
SERVING
32. Materialize
Data to
Tables
Handle Client
REST Requests
2) Push New Lessons to User Based on Failures
Joe
Content Serving
Domain
LESSONS
Maria
key: LessonID-623
assets: S3://….
medium: Written
subject: Verbs
difficulty: Intermediate
key: UserId-6384291
LessonId: LessonID-623
Type: Audio
Status: Failed
Operational Use-Case Using lesson-completion events for both
operational and analytical use-cases
SERVING
SERVING
33. 3) Expanding the Business Domain: Premium Content
USERS
Alice
LESSONS
Maria
key: LessonID-623
Assets: S3://….
Medium: Written
Subject: Verbs
Difficulty: Intermediate
Status: Premium
key: UserId-6384291
Name: Adam Bellemare
Address: Canada
Email: k2hd9@9fd9s.com
Status: Premium
Add special content that is only
available for premium (paid) users
a) Evolve the User event to
contain a status enum:
(Premium / Normal)
b) Add new content that is only
available for premium users
c) Governance requirement: a
standard definition of
premium across the whole
business.
34. 3) Expanding the Business Domain: Premium Content
USERS
Alice
LESSONS
Maria
key: LessonID-623
Assets: S3://….
Medium: Written
Subject: Verbs
Difficulty: Intermediate
Status: Premium
key: UserId-6384291
Name: Adam Bellemare
Address: Canada
Email: k2hd9@9fd9s.com
Status: Premium
SERVING
Materialize
Data to
Tables
Handle Client
REST Requests
Update Business Logic to show
paid users the Premium Content
35. ksqlDB
ANALYTICS
BI Tool
3) Build new Analytics off Premium
LESSONS
Maria
key: LessonID-623
Assets: S3://….
Medium: Written
Subject: Verbs
Difficulty: Intermediate
Status: Premium
Joe
Content Serving
Domain
key: UserId-6384291
LessonId: LessonID-623
Type: Audio
Status: Completed
Source-Aligned Data Products
CONTENT