"Data mesh is a relatively recent architectural innovation, espoused as one of the best ways to fix analytic data. We renegotiate aged social conventions by focusing on treating data as a product, with a clearly defined data product owner, akin to that of any other product. In addition, we focus on building out a self-service platform with integrated governance, letting consumers safely access and use the data they need to solve their business problems.
Data mesh is prescribed as a solution for _analytical data_, so that conventionally analytical results (think weekly sales or monthly revenue reports) can be more accurately and predictably computed. But what about non-analytical business operations? Would they not also benefit from data products backed by self-service capabilities and dedicated owners? If you've ever provided a customer with an analytical report that differed from their operational conclusions, then this talk is for you.
Adam discusses the resounding successes he has seen from applying data mesh _off-label_ to both analytical and operational domains. The key? Event streams. Well-defined, incrementally updating data products that can power both real-time and batch-based applications, providing a single source of data for a wide variety of application and analytical use cases. Adam digs into the common areas of success seen across numerous clients and customers and provides you with a set of practical guidelines for implementing your own minimally viable data mesh.
Finally, Adam covers the main social and technical hurdles that you'll encounter as you implement your own data mesh. Learn about important data use cases, data domain modeling techniques, self-service platforms, and building an iteratively successful data mesh."
17. Microservices (Synchronous)
APIs provide dedicated business functions
Clients couple on REST APIs
SLAs, change management, responsibilities
Common platform
(Kubernetes, Docker, Service Catalog, etc.)
18. 18
Database
Application
Services can end up looking like weird databases
getOrder(order_id)
getOrder(user_id)
getAllOrders()
getAllPendingOrders()
getAllCanadianOrders()
21. Reuse Existing Strategies
APIs Serve Business Functions Data Products serve Data
Couple on REST APIs Couple on the Data Contact
Microservices Data Mesh
22. Reuse Existing Strategies
APIs Serve Business Functions Data Products serve Data
Couple on REST APIs Couple on the Data Contact
SLAs, change management,
responsibilities
SLAs, change management,
responsibilities
Microservices Data Mesh
23. Reuse Existing Strategies
APIs Serve Business Functions Data Products serve Data
Couple on REST APIs Couple on the Data Contact
SLAs, change management,
responsibilities
SLAs, change management,
responsibilities
Common platform and controls Common platform and controls
Microservices Data Mesh
25. Principle 1: Domain Ownership
Objective: Data is owned by those that truly understand it
Pattern: Data belongs to the team
who understands it best
Centralized
Data Ownership
Decentralized
Data Ownership
Anti-pattern: Centralized team owns all data
26. Principle 2: Data as a First-Class Product
Objective: Make shared data discoverable, addressable, trustworthy, secure, so
other teams can make good use of it.
● Data is treated as a true product, not a by-product.
27. Domain
Data Product
Data Product, a “Microservice for the Data World”
● Data product is a node on the data mesh, situated within a domain.
● Produces—and possibly consumes—high-quality data within the mesh.
Infra
Code
Data
Creates, manipulates,
serves, etc. that data
Powers the data (e.g., storage) and the
code (e.g., run, deploy, monitor)
“Items About to Expire”
Data Product
Data and metadata,
including history
28. Principle 3: Self-Serve Data Platform
Provide discovery, access, and self-service compute and publish tools
Objective: Make it easy to both create and use the data products
29. Principle 4: Federated Governance
Objective: Standards of Interoperability, Policies, and Support
● Global standards, data product support. “Paved Roads”.
What is decided locally?
What is decided globally?
(implemented and enforced
by platform)
Domain Domain Domain
Domain
Self-Serve Data Platform
Must balance between Decentralization and Centralization. No silver bullet!
30. Data Fabric Data Mesh
Access Virtually
via Fabric Layer
Access Data
Directly
Self-Serve &
Governance
Responsibility
Model
Decentralized
Management
Primarily
Technical Well-Formed
Data Sources
Centralized
Management
Social &
Technical
Not Mutually
Exclusive!
33. Data Ownership is Split Across Multiple Teams
33
30-Min
Job
Source
DB
Data
Engineer
App
Owner
Daily
Job
Daily
Report
The Boss
Data Scientist
Data Lake
App
34. How would we approach
this pattern using Data Mesh?
37. Data on the Outside
Relational DB
Ad-Hoc
Copy
Cron Job
Read-Only
Replica
Data Lake
App App
App
App
Ad-Hoc
Copy of
a Copy
App
38. Data on the Inside
38
Source
DB
App
Dev
App
Data on the Outside
Data
Lake
App
Cloud
SaaS
39. In Data Mesh, Ownership is Moved Left
39
30-Min
Job
Source
DB
Data
Engineer
App
Owner
Data Lake
App
App
Owner
40. Create a Data Product
40
Source
DB
App
App
Owner
30-Min
Job
Data Lake
Data
Product
App
Owner
Data Product
Owner
41. Negotiate the Social Changes
Source
DB
App
App
Owner
Data
Product
Data Product
Owner
Current Data
Product User
Prospective Data
Product User
42. Data
Product
Owner & Domain Schema
Format (or API) Location
Service Level (SLA) Restrictions
Data Product Metadata
43. Data
Product
Owner & Domain
Adam (Engineering)
Schema
<Parquet Schema>
Format (or API)
Iceberg Table
Location
S3://bucketname/…
Service Level (SLA)
Tier 2
Restrictions
Top Secret
Data Product Metadata
Data Lake
44. Data
Product
Owner & Domain
Adam (Engineering)
Schema
<Parquet Schema>
Format (or API)
Iceberg Table
Location
S3://bucketname/…
Service Level (SLA)
Tier 2
Restrictions
Top Secret
End-user functionality via self-service platform
Data Lake
Request Access Contact Owner
45. Can Again Draw on Known Practices
1st Class Language and
Framework Support
Same
Self-Service Portals & Tools Same
Code Generators
(APIs, Clients, Servers)
Same
Catalogging and Discovery Same
Microservices Data Mesh
47. How to Start
47
Start Small, Keep it Simple
• Spreadsheet of data products
• Ticket system
• Focus on the data
48. Iterate: Review, Revise, and Improve
48
Pave the Roads
• Make it easy
• Prototype and Trial Changes
• Share Successes
49. Data Fabric on top of Data Products?
Domain Domain
Domain
Data Mesh Self-Serve Data Platform
Data Fabric Access Layer
(API, Governance, Permissions, Lineage)
53. Growing Operational Data Requirements
Act on Data in real-time
(Tactical, Operational)
Decouple services
53
54. Growing Operational Data Requirements
Act on Data in real-time
(Tactical, Operational)
Decouple services
Combine Separate
Data Sources
54
55. Growing Operational Data Requirements
Act on Data in real-time
(Tactical, Operational)
Decouple services
Combine Separate
Data Sources
Remodel Data
55
56. Publish data to event streams
0 1 2 3 4 5 6 7
Kafka Topic
Producer
Application
Publish
56
57. Consumers read at their own rate
0 1 2 3 4 5 6 7
Kafka Topic
Producer
Application
Publish
Consumer
Application
Read
57
58. Consumers read at their own rate
0 1 2 3 4 5 6 7
Kafka Topic
Producer
Application
Publish
Consumer
Application
Read
58
59. Consumers can reread the data as needed
0 1 2 3 4 5 6 7
Kafka Topic
Producer
Application
Publish
Consumer
Application
Reread
Topic as
needed
Consumer
Application
59
60. Combine Event Streams with Data Mesh
Data Mesh
For Analytical
Data Problems
Off-Label Uses
Operations!
(Batch and
Streaming)
60
62. Data
Product
Parquet Files to S3 (eg: Apache Iceberg Table)
62
Daily
Job
30-Min
Job
Orders
Service
DB
Daily
Order
Report
Parquet Files
Data Lake
The
Boss
Port
Too slow!
What about real-time?
63. Data
Product
Can use a Kafka Topic with a Schema
Port
Parquet Files
0 1 2 3 4 5 6 7
Port
Kafka Topic
63
64. Data
Product
Can use a Kafka Topic with a Schema
Port
Parquet Files
0 1 2 3 4 5 6 7
Port
Kafka Topic
order_id total items time
1 $19.99 [...] 186…
2 $12.99 [...] 187…
3 $24.99 [...] 188…
4 $38.99 [...] 189…
64
65. Data
Product
Can use a Kafka Topic with a Schema
Port
Parquet Files
0 1 2 3 4 5 6 7
Port
Kafka Topic
order_id total items time
1 $19.99 [...] 186…
2 $12.99 [...] 187…
3 $24.99 [...] 188…
4 $38.99 [...] 189…
{
"order_id": Long,
"total": Double,
"items": List[Item],
"time": Long
}
65
66. Data
Product
Each event represents an Order
{
"order_id": 1,
"total": $19.99,
"items": [...],
"time": 186…
}
{
"order_id": 2,
"total": $12.99,
"items": [...],
"time": 187…
}
0 1 2 3 4 5 6 7
66
67. Data
Product
Giving consumers a choice for data access
Port
Parquet Files
0 1 2 3 4 5 6 7
Port
Kafka Topic
67
Prospective Data
Product Users
68. Select the option that works best for you
Port
Parquet Files
0 1 2 3 4 5 6 7
Port
Kafka Topic 68
Batch-Computed
Analytics
Streaming
Operational App
Streaming
Analytics
69. Build the Table from the Stream
69
0 1 2 3 4 5 6 7
Topic Parquet-Backed
Table
72. @AdamBellemare | developer.confluent.io
Order Fact Event (a DTO)
{
"order_id": 1,
"items": [ 521, 923 ],
"total": 19.99,
"timestamp": 186…
}
item_added_to_order
{
"order_id": 1,
"item_id": 521,
"quantity": 1
}
discount_code_applied
{
"order_id": 1,
"discount_code": "SAVE-20-2022",
"discount_percent": "20"
}
Facts Model State – Deltas Model Change
Delta
Events
72
73. @AdamBellemare | developer.confluent.io
Stream-Table Duality with Fact Events
Key (order_id) Value
03:03 1 items: [100], total: $10.00
08:33 2 items: [200], total: $20.00
13:21 1 items: [100, 200], total: $30.00
13:22 3 items: [450, 451], total: $68.50
19:54 1 Null
Time Value
Upsert each Record in Sequence
Current Materialized State
Event Stream Table
73
(Key)
order_id
74. @AdamBellemare | developer.confluent.io
Stream-Table Duality with Fact Events
Key (order_id) Value
1 items: [100], total: $10.00
03:03 1 items: [100], total: $10.00
08:33 2 items: [200], total: $20.00
13:21 1 items: [100, 200], total: $30.00
13:22 3 items: [450, 451], total: $68.50
19:54 1 Null
Time Value
Read
Upsert each Record in Sequence
Current Materialized State
Event Stream Table
74
(Key)
order_id
75. @AdamBellemare | developer.confluent.io
Stream-Table Duality with Fact Events
Key (order_id) Value
1 items: [100], total: $10.00
2 items: [200], total: $20.00
03:03 1 items: [100], total: $10.00
08:33 2 items: [200], total: $20.00
13:21 1 items: [100, 200], total: $30.00
13:22 3 items: [450, 451], total: $68.50
19:54 1 Null
Time Value
Read
Upsert each Record in Sequence
Current Materialized State
Event Stream Table
75
(Key)
order_id
76. @AdamBellemare | developer.confluent.io
Stream-Table Duality with Fact Events
Key (order_id) Value
1 items: [100, 200], total: $30.00
2 items: [200], total: $20.00
03:03 1 items: [100], total: $10.00
08:33 2 items: [200], total: $20.00
13:21 1 items: [100, 200], total: $30.00
13:22 3 items: [450, 451], total: $68.50
19:54 1 Null
Time Value
Read
Upsert each Record in Sequence
Current Materialized State
Event Stream Table
76
(Key)
order_id
77. @AdamBellemare | developer.confluent.io
Stream-Table Duality with Fact Events
Key (order_id) Value
1 items: [100, 200], total: $30.00
2 items: [200], total: $20.00
3 items: [450, 451], total: $68.50
03:03 1 items: [100], total: $10.00
08:33 2 items: [200], total: $20.00
13:21 1 items: [100, 200], total: $30.00
13:22 3 items: [450, 451], total: $68.50
19:54 1 Null
Time Value
Read
Upsert each Record in Sequence
Current Materialized State
Event Stream Table
77
(Key)
order_id
78. @AdamBellemare | developer.confluent.io
Stream-Table Duality with Fact Events
Key (order_id) Value
1 Null (deleted)
2 items: [200], total: $20.00
3 items: [450, 451], total: $68.50
03:03 1 items: [100], total: $10.00
08:33 2 items: [200], total: $20.00
13:21 1 items: [100, 200], total: $30.00
13:22 3 items: [450, 451], total: $68.50
19:54 1 Null
Time Value
Read
Upsert each Record in Sequence
Current Materialized State
Event Stream Table
78
(Key)
order_id
79. @AdamBellemare | developer.confluent.io
Deltas Are Used in Event Sourcing
(Data on the Inside)
Customer Product Quantity
Robert
Pants 1
T-Shirts 1
Hats 15
11:37 2 Pants added to order
11:39 1 T-Shirt added to order
11:41 1 Pants removed from order
11:42 15 Hats added to order
11:42 Apply Discount Code
Time Value
Read
Apply in sequence to build state
Current Consumer State
Stream Table
79
80. @AdamBellemare | developer.confluent.io
Deltas Not Suitable for Building External State
(Data on the Outside)
11:37 2 Pants added to order
11:39 1 T-Shirt added to order
11:41 1 Pants removed from order
11:42 15 Hats added to order
11:42 Apply Discount Code
Time Value
Read
Duplicate and
Tightly Coupled
order Building
Logic
Independent Consumer
Services
Apply in sequence to build state
Apply in sequence
to build state
Stream
80