SlideShare a Scribd company logo
1 of 48
Download to read offline
Architect’s Open-Source
Guide for a Data Mesh
Architecture
Lena Hall
Microsoft
Lena Hall
Director at Microsoft
Azure Engineering
ü Architecture
ü Cloud
ü Data
ü ML/AI
lenadroid
Entry Point
How to Move Beyond a Monolithic Data Lake to a Distributed
Data Mesh
https://martinfowler.com/articles/data-monolith-to-mesh.html
Data Mesh Principles and Logical Architecture
https://martinfowler.com/articles/data-mesh-principles.html
Slack for Data-Mesh-Learning
https://launchpass.com/data-mesh-learning
lenadroid
Talk Snapshot
• What is Data Mesh
• When is Data Mesh a Good Idea
• Core Principles and Concepts
• Example: Drone Delivery Service
• Challenges
• OSS and Open Standards
lenadroid
When and Why
Data Mesh
@lenadroid
Data Mesh is Not For Everyone
Challenges Indicating Data Mesh
May Be Considered
@lenadroid
Drone Delivery Service
lenadroid
WHYs
• Ambiguity in Ownership and Responsibility
• Slow Change due to Coupling to Monolithic System
• Data Engineering Resources Bottleneck
lenadroid
Ideas Composing Data Mesh Concept
@lenadroid
Core Ideas
ü Decentralized teams and data ownership
lenadroid
Core Ideas
ü Decentralized teams and data ownership
ü Data Products powered by Domain Driven Design
lenadroid
High-Level View of a Data Product
lenadroid
Core Ideas
ü Decentralized teams and data ownership
ü Data Products powered by Domain Driven Design
ü Self-serve Shared Data Infrastructure
lenadroid
Core Ideas
ü Decentralized teams and data ownership
ü Data Products powered by Domain Driven Design
ü Self-serve Shared Data Infrastructure
ü Global Federated Governance
lenadroid
Drone Delivery Service Data Products
@lenadroid
lenadroid
Core Principles for Data Products
@lenadroid
lenadroid
DISCOVERABLE
Core Principles for Data Products
lenadroid
DISCOVERABLE
SELF-DESCRIBING
Core Principles for Data Products
lenadroid
DISCOVERABLE
SELF-DESCRIBING
ADDRESSABLE
Core Principles for Data Products
lenadroid
DISCOVERABLE
SELF-DESCRIBING
ADDRESSABLE
SECURE
Core Principles for Data Products
lenadroid
DISCOVERABLE
SELF-DESCRIBING
ADDRESSABLE
SECURE
TRUSTWORTHY
Core Principles for Data Products
lenadroid
DISCOVERABLE
SELF-DESCRIBING
ADDRESSABLE
SECURE
TRUSTWORTHY
INTEROPERABLE
Core Principles for Data Products
Input Ports Questions
• Data Source - Where is the data coming from? External dataset or
another data product?
• Data Format - What is the format of the source input?
• Rate of Updates - How frequently does the input need to be updated?
Output Ports Questions
• End-consumers - Who are the end-users of the data product?
• Data purpose - What are they planning to do with the data outputs?
• Data access - Who needs to have access? How do they prefer to access
the data output?
• Data address - How do they prefer to access the data output?
• Data Format - What format of the data do they expect?
Identity and Permission Policies Questions
• Which resources can this data product be allowed to access?
• Which data products or users can read which output ports of this data
product?
• Are all sensitive resources this data product offers protected according
their required privacy standards (e.g. HIPAA, GDPR, PII, CCPA, etc.)
• Is the permissions policy stored and managed in the same package
as the data product?
Data Product Action Questions
• What is the action that needs to happen to produce the outcomes for the
end-users?
• What are the required adjustments, transformations, filters, updates, or
quality improvements to the input data?
Operational Questions
• How can this data product be discovered and how should it be described to
other data products that might want to consume it?
• Which metadata and information should it make available to the end-
users?
• Where and how should data product versioning be managed during
updates to ensure consistency with how the end-users consume it?
• Which SLAs or SLOs does the data product provide?
• Which product success metrics can this data product expose and keep
track of? (adoption, usage, quality)
• Is the automation/resource orchestration logic stored in the same
package?
Other Questions
• Is this product not tightly coupled to any other data source, data product,
or any other resource that makes him not interoperable?
• Does this data product follow the defined global governance standards and
practices defined by the organization?
• Does this data product have any implementation details that could
interfere with its portability?
Cheat Sheet for Planning Data Products lenadroid
Self-Serve Shared Infrastructure
@lenadroid
REAL-TIME DATA
INGESTION
PROCESSING
OBJECT STORAGE
COLUMNAR STORAGE
PROCESSING
COLUMNAR STORAGE
INCOMING
REQUEST
WEB SERVICE
PROCESSING
lenadroid
Types of Workloads Within a Data Product
WEB SERVICE
PROCESSING
lenadroid
It can look like this
Azure Data
Lake
WEB SERVICE
lenadroid
Or, it can look like this
Google
Storage
lenadroid
Self-Serve Shared Infrastructure
SHARED PLATFORM FOR
STREAMING INGESTION
SHARED PLATFORM FOR
RAW DATA STORAGE
SHARED PLATFORM FOR
COLUMNAR DATA STORAGE
SHARED PLATFORM FOR
CONTAINER WORKLOADS
SHARED PLATFORM FOR
CONTINUOUS DELIVERY
SHARED PLATFORM
FOR OBSERVABILITY
AND MORE…
DEPENDING ON THE ORGANIZATION
DATA CATALOGUE
lenadroid
SHARED PLATFORM FOR
CONTAINER WORKLOADS
SHARED PLATFORM FOR
CONTINUOUS DELIVERY
SHARED PLATFORM
FOR OBSERVABILITY
DISCOVERABLE
SELF-DESCRIBING
ADDRESSABLE
SECURE
TRUSTWORTHY
INTEROPERABLE
Data Mesh
SHARED PLATFORM FOR
STREAMING INGESTION
SHARED PLATFORM FOR
RAW DATA STORAGE
SHARED PLATFORM FOR
COLUMNAR DATA STORAGE
DATA CATALOGUE
Wait, What About the OSS Tools for Data Mesh??
@lenadroid
Challenges with Data Mesh
@lenadroid
Challenges
• Cost questions
• Lack of end-to-end examples
• Efforts to shift from centralized architecture to decentralization-
friendly techniques
• Automation required for enabling creating data products
• Underestimating the importance organizational aspects
lenadroid
Considerations for Technology Choices
@lenadroid
Considerations for Technology Choices
• Workload sharing and multi-tenancy
• No-copy data and compute mobility support
• Granularity of access-control
• Richness of automation and extensibility capabilities
• Flexibility and elasticity
• Provider-agnostic/multi-cloud operations support
• Variety of limitations
(quotas, data volume, resource count, etc.)
• Open Standards, Open Protocols, Open-Source Integrations
lenadroid
Examples of Data Mesh-friendly Technologies
@lenadroid
data
Anthos
Azure Arc
Data Catalogue, Data Lineage,
Data Governance
OSS Data Analytics, Data
Processing, Data Querying
Cloud Storage
Open Formats
Data Ingestion, Streaming
Data Orchestration, Workflows
OSS Storage
Products for Data Analytics and
Processing
Data Visualization and BI Tools
Data Experimentation
Cross-Platform Concepts and Tools
Multi and Hybrid Cloud Tools
Amazon S3
Azure Data Lake
Google Storage
Infrastructure Automation
lenadroid
Data Governance Systems
• Metadata
• Data lineage
• Data schemas
• Data relationships
• Data classification
• Data security
• Data catalog
lenadroid
Open Formats
• Open standard
• Atomic updates, serializable isolation, transactions
• Concurrent operations
• Versioning, rollbacks, time-travel
• Schema Evolution
• Scale, Efficiency, Data Volumes
• Compatibility with existing data stores and languages
lenadroid
Data Platforms (Cloud or OSS)
• Separation of storage and compute
• Support for no-copy data sharing
• Bringing compute to data
• Fine-tuned granularity of permissions for access
• Support for automation and resource management
• Open standards and interoperability with other platforms and
tools for governance, visualization, analytics, etc.
lenadroid
Multi-Cloud Infrastructure Management
• Terraform
Open-source infrastructure as code software tool that enables you to safely and
predictably create, change, and improve infrastructure.
• Pulumi
Open-source infrastructure as code SDK that enables you to create, deploy, and
manage infrastructure on any cloud, using your favorite languages.
• Crossplane
Assemble infrastructure from multiple vendors, and expose higher level self-service
APIs for application teams to consume, without having to write any code.
lenadroid
Multi-Cloud Workload Portability
• Azure Arc
Build cloud-native apps anywhere, at scale. Run Azure services in any Kubernetes
environment, whether it’s on-premises, multi-cloud, or at the edge
• Google Athnos
A modern application management platform that provides a consistent development
and operations experience for cloud and on-premises environments
lenadroid
Kubernetes Open-Standard Technologies
• Open Application Model
An open standard for defining cloud native apps.
KubeVella - https://kubevela.io/docs/concepts
• Open Policy Agent
Declarative Policy-as-Code, enables portability, combination with Infra-as-Code.
https://www.openpolicyagent.org/docs/latest
• Service Catalog
Provision managed services and make them available within a Kubernetes cluster.
https://kubernetes.io/docs/concepts/extend-kubernetes/service-catalog/
NOT AN EXHAUSTIVE LIST
lenadroid
Benefits Brought by Data Mesh
• Data Quality
• Tailored resource and focus allocation
• Organizational cohesion while allowing flexibility
• Reducing complexity
• Democratizing creating value
• Better understanding of value and innovation opportunities
• Empowering a more consistent and fast change
@lenadroid
Important Focus Areas for Technology Providers
• Open Standards, Open Protocols, Open-Source Integrations
• Workload sharing and multi-tenancy
• No-copy data and compute mobility support
• Granularity of access-control
• Richness of automation and extensibility capabilities
• Flexibility and elasticity
• Provider-agnostic/multi-cloud operations support
• Variety of limitations
(quotas, data volume, resource count, etc.)
@lenadroid
Data Mesh will drive better Interoperability, Open
Standards, and Data Quality in the Industry
@lenadroid
Thank you!
Follow lenadroid for more insights

More Related Content

What's hot

Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 

What's hot (20)

Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
adb.pdf
adb.pdfadb.pdf
adb.pdf
 

Similar to Architect’s Open-Source Guide for a Data Mesh Architecture

Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 

Similar to Architect’s Open-Source Guide for a Data Mesh Architecture (20)

Finding Your Ideal Data Architecture: Data Fabric, Data Mesh or Both?
Finding Your Ideal Data Architecture: Data Fabric, Data Mesh or Both?Finding Your Ideal Data Architecture: Data Fabric, Data Mesh or Both?
Finding Your Ideal Data Architecture: Data Fabric, Data Mesh or Both?
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech Proposals
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal Modernization
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.
 
Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...
Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...
Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Connected development data
Connected development dataConnected development data
Connected development data
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 

Recently uploaded

Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 

Recently uploaded (20)

2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 

Architect’s Open-Source Guide for a Data Mesh Architecture

  • 1. Architect’s Open-Source Guide for a Data Mesh Architecture Lena Hall Microsoft
  • 2. Lena Hall Director at Microsoft Azure Engineering ü Architecture ü Cloud ü Data ü ML/AI lenadroid
  • 3. Entry Point How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh https://martinfowler.com/articles/data-monolith-to-mesh.html Data Mesh Principles and Logical Architecture https://martinfowler.com/articles/data-mesh-principles.html Slack for Data-Mesh-Learning https://launchpass.com/data-mesh-learning lenadroid
  • 4. Talk Snapshot • What is Data Mesh • When is Data Mesh a Good Idea • Core Principles and Concepts • Example: Drone Delivery Service • Challenges • OSS and Open Standards lenadroid
  • 5. When and Why Data Mesh @lenadroid
  • 6. Data Mesh is Not For Everyone
  • 7. Challenges Indicating Data Mesh May Be Considered @lenadroid
  • 9. WHYs • Ambiguity in Ownership and Responsibility • Slow Change due to Coupling to Monolithic System • Data Engineering Resources Bottleneck lenadroid
  • 10. Ideas Composing Data Mesh Concept @lenadroid
  • 11. Core Ideas ü Decentralized teams and data ownership lenadroid
  • 12. Core Ideas ü Decentralized teams and data ownership ü Data Products powered by Domain Driven Design lenadroid
  • 13. High-Level View of a Data Product lenadroid
  • 14. Core Ideas ü Decentralized teams and data ownership ü Data Products powered by Domain Driven Design ü Self-serve Shared Data Infrastructure lenadroid
  • 15. Core Ideas ü Decentralized teams and data ownership ü Data Products powered by Domain Driven Design ü Self-serve Shared Data Infrastructure ü Global Federated Governance lenadroid
  • 16. Drone Delivery Service Data Products @lenadroid
  • 18. Core Principles for Data Products @lenadroid
  • 25. Input Ports Questions • Data Source - Where is the data coming from? External dataset or another data product? • Data Format - What is the format of the source input? • Rate of Updates - How frequently does the input need to be updated? Output Ports Questions • End-consumers - Who are the end-users of the data product? • Data purpose - What are they planning to do with the data outputs? • Data access - Who needs to have access? How do they prefer to access the data output? • Data address - How do they prefer to access the data output? • Data Format - What format of the data do they expect? Identity and Permission Policies Questions • Which resources can this data product be allowed to access? • Which data products or users can read which output ports of this data product? • Are all sensitive resources this data product offers protected according their required privacy standards (e.g. HIPAA, GDPR, PII, CCPA, etc.) • Is the permissions policy stored and managed in the same package as the data product? Data Product Action Questions • What is the action that needs to happen to produce the outcomes for the end-users? • What are the required adjustments, transformations, filters, updates, or quality improvements to the input data? Operational Questions • How can this data product be discovered and how should it be described to other data products that might want to consume it? • Which metadata and information should it make available to the end- users? • Where and how should data product versioning be managed during updates to ensure consistency with how the end-users consume it? • Which SLAs or SLOs does the data product provide? • Which product success metrics can this data product expose and keep track of? (adoption, usage, quality) • Is the automation/resource orchestration logic stored in the same package? Other Questions • Is this product not tightly coupled to any other data source, data product, or any other resource that makes him not interoperable? • Does this data product follow the defined global governance standards and practices defined by the organization? • Does this data product have any implementation details that could interfere with its portability? Cheat Sheet for Planning Data Products lenadroid
  • 27. REAL-TIME DATA INGESTION PROCESSING OBJECT STORAGE COLUMNAR STORAGE PROCESSING COLUMNAR STORAGE INCOMING REQUEST WEB SERVICE PROCESSING lenadroid Types of Workloads Within a Data Product
  • 28. WEB SERVICE PROCESSING lenadroid It can look like this Azure Data Lake
  • 29. WEB SERVICE lenadroid Or, it can look like this Google Storage
  • 30. lenadroid Self-Serve Shared Infrastructure SHARED PLATFORM FOR STREAMING INGESTION SHARED PLATFORM FOR RAW DATA STORAGE SHARED PLATFORM FOR COLUMNAR DATA STORAGE SHARED PLATFORM FOR CONTAINER WORKLOADS SHARED PLATFORM FOR CONTINUOUS DELIVERY SHARED PLATFORM FOR OBSERVABILITY AND MORE… DEPENDING ON THE ORGANIZATION DATA CATALOGUE
  • 31. lenadroid SHARED PLATFORM FOR CONTAINER WORKLOADS SHARED PLATFORM FOR CONTINUOUS DELIVERY SHARED PLATFORM FOR OBSERVABILITY DISCOVERABLE SELF-DESCRIBING ADDRESSABLE SECURE TRUSTWORTHY INTEROPERABLE Data Mesh SHARED PLATFORM FOR STREAMING INGESTION SHARED PLATFORM FOR RAW DATA STORAGE SHARED PLATFORM FOR COLUMNAR DATA STORAGE DATA CATALOGUE
  • 32. Wait, What About the OSS Tools for Data Mesh?? @lenadroid
  • 33. Challenges with Data Mesh @lenadroid
  • 34. Challenges • Cost questions • Lack of end-to-end examples • Efforts to shift from centralized architecture to decentralization- friendly techniques • Automation required for enabling creating data products • Underestimating the importance organizational aspects lenadroid
  • 35. Considerations for Technology Choices @lenadroid
  • 36. Considerations for Technology Choices • Workload sharing and multi-tenancy • No-copy data and compute mobility support • Granularity of access-control • Richness of automation and extensibility capabilities • Flexibility and elasticity • Provider-agnostic/multi-cloud operations support • Variety of limitations (quotas, data volume, resource count, etc.) • Open Standards, Open Protocols, Open-Source Integrations lenadroid
  • 37. Examples of Data Mesh-friendly Technologies @lenadroid
  • 38. data Anthos Azure Arc Data Catalogue, Data Lineage, Data Governance OSS Data Analytics, Data Processing, Data Querying Cloud Storage Open Formats Data Ingestion, Streaming Data Orchestration, Workflows OSS Storage Products for Data Analytics and Processing Data Visualization and BI Tools Data Experimentation Cross-Platform Concepts and Tools Multi and Hybrid Cloud Tools Amazon S3 Azure Data Lake Google Storage Infrastructure Automation lenadroid
  • 39. Data Governance Systems • Metadata • Data lineage • Data schemas • Data relationships • Data classification • Data security • Data catalog lenadroid
  • 40. Open Formats • Open standard • Atomic updates, serializable isolation, transactions • Concurrent operations • Versioning, rollbacks, time-travel • Schema Evolution • Scale, Efficiency, Data Volumes • Compatibility with existing data stores and languages lenadroid
  • 41. Data Platforms (Cloud or OSS) • Separation of storage and compute • Support for no-copy data sharing • Bringing compute to data • Fine-tuned granularity of permissions for access • Support for automation and resource management • Open standards and interoperability with other platforms and tools for governance, visualization, analytics, etc. lenadroid
  • 42. Multi-Cloud Infrastructure Management • Terraform Open-source infrastructure as code software tool that enables you to safely and predictably create, change, and improve infrastructure. • Pulumi Open-source infrastructure as code SDK that enables you to create, deploy, and manage infrastructure on any cloud, using your favorite languages. • Crossplane Assemble infrastructure from multiple vendors, and expose higher level self-service APIs for application teams to consume, without having to write any code. lenadroid
  • 43. Multi-Cloud Workload Portability • Azure Arc Build cloud-native apps anywhere, at scale. Run Azure services in any Kubernetes environment, whether it’s on-premises, multi-cloud, or at the edge • Google Athnos A modern application management platform that provides a consistent development and operations experience for cloud and on-premises environments lenadroid
  • 44. Kubernetes Open-Standard Technologies • Open Application Model An open standard for defining cloud native apps. KubeVella - https://kubevela.io/docs/concepts • Open Policy Agent Declarative Policy-as-Code, enables portability, combination with Infra-as-Code. https://www.openpolicyagent.org/docs/latest • Service Catalog Provision managed services and make them available within a Kubernetes cluster. https://kubernetes.io/docs/concepts/extend-kubernetes/service-catalog/ NOT AN EXHAUSTIVE LIST lenadroid
  • 45. Benefits Brought by Data Mesh • Data Quality • Tailored resource and focus allocation • Organizational cohesion while allowing flexibility • Reducing complexity • Democratizing creating value • Better understanding of value and innovation opportunities • Empowering a more consistent and fast change @lenadroid
  • 46. Important Focus Areas for Technology Providers • Open Standards, Open Protocols, Open-Source Integrations • Workload sharing and multi-tenancy • No-copy data and compute mobility support • Granularity of access-control • Richness of automation and extensibility capabilities • Flexibility and elasticity • Provider-agnostic/multi-cloud operations support • Variety of limitations (quotas, data volume, resource count, etc.) @lenadroid
  • 47. Data Mesh will drive better Interoperability, Open Standards, and Data Quality in the Industry @lenadroid
  • 48. Thank you! Follow lenadroid for more insights