Data Mesh is a fairly new approach to help companies do more with data, faster. It requires both organizational and technical changes to enable autonomy and self-service, treat data as a product and encourage secure collaboration.
In this session, we will discuss practical approaches you can implement today to help your company start benefiting from Data Mesh. We'll show you how to create autonomy by splitting responsibility between data producers and consumers, share datasets and make data discovery easy.
We'll show a demo with producers building an ingestion pipeline that publishes datasets to consumer accounts (data mesh domains). SQL templates will be provided for members to follow along and build on their own.
We'll present these use cases built with data mesh design patterns:
1. A multi-tenant data lake that allows data producers to share datasets with consumers outside of the organization (3rd parties).
2. A security data lake that allows different teams to publish curated logs to their local Elasticsearch clusters for analysis, and to a central data lake for retention, auditing and historical analysis.
We'll also discuss managing data contracts/schemas between producers and consumers, to enable ownership and better data quality when sharing datasets.
Meetup: https://www.meetup.com/boston-data-engineering/events/291383661/
Video: https://youtu.be/lIcmomYZ3mo
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Boston Data Engineering: Designing and Implementing Data Mesh at Your Company with Upsolver
1. Designing and implementing
Data Mesh at your company
In partnership with:
Participating meetups in
Boston
NYC
Chicago
Toronto
Montreal
2. Who we are
/in/royhasson/
/in/jasonfhall/
Roy Hasson - Head of product @ Upsolver
Jason Hall - Sr. Solutions Architect @ Upsolver
Ex-AWS
- Product for Amazon Athena, AWS Glue and AWS Lake Formation
- Founding member of AWS Data Lake and Data Mesh initiatives
- Guiding and supporting Data Mesh implementations with customers
- Works with customers to plan and implement data pipeline strategies
- Helps to ensure successful data projects from inception to production
-
3. Challenge to make big impacts, quicker
Business users are saying:
It takes too long to onboard new data
Central IT/data teams are a bottleneck
Can’t find, understand and access data
Takes too long to make small tweaks
Engineering users are saying:
We don’t understand business needs
Too many requests and tweaks
Integrations are complex and fragile
Difficult to hire good data engineers
4. Trying to solve the challenge with existing patterns
https://aws.amazon.com/big-data/what-is-a-data-lake/ https://databricks.com/product/data-lakehouse
Lakehouse
Decoupled
Data Lake
Build to suit
https://www.snowflake.com/blog/data-cloud-hybrid-data-warehouse-data-lake/
Data Warehouse
Hybrid
5. These solutions do not work on their own
Data lake
- Too low level, integrations are manual and complex
- Encourages inconsistent implementations, difficult to secure
- Open and vibrant community
Lakehouse
- Fewer tools options, simpler to implement, manual integrations
- Encourages centralization and lock-in
- Vibrant community in parts of the stack (storage and core engine)
Hybrid DWH
- 3-4 primary vendors to choose from, vertically integrated
- Encourages centralization and lock-in
- Limited by the vendor’s roadmap
6. This is not what we’re talking about
https://future.a16z.com/emerging-architectures-modern-data-infrastructure/
7. …this - Introducing Data Mesh
https://martinfowler.com/articles/data-monolith-to-mesh.html
Flexible organization design aligned to business needs
8. Flexible organization design and self-service tooling
Data domains - Autonomous units with ownership and accountability. Domains can produce
and/or consume data with other domains
Data infrastructure as a platform - Build once use everywhere. Enables consistent tooling,
engineering and security best practices, and ease of integration.
Data as a product - Data assets are treated like products. Delivered in a reliable, consistent and
secure manner. They are easily discoverable and accessible across the org
Overarching governance - Procedures and guidelines to secure, audit and control quality of data
in the organization.
9. Why Data Mesh at JPMC
Source JPMC July 2021 @ Data Mesh Learning Meetup - https://youtu.be/7iazNKG8XQo
10. High level Data Mesh design @ JPMC
Source AWS @ https://aws.amazon.com/blogs/big-data/how-jpmorgan-chase-built-a-data-mesh-architecture-to-drive-significant-value-to-enhance-their-enterprise-data-platform/
11. A single data domain built on an open data lake architecture
Source JPMC July 2021 @ Data Mesh Learning Meetup - https://youtu.be/7iazNKG8XQo
12. Creating a mesh with multiple data domains
Source JPMC July 2021 @ Data Mesh Learning Meetup - https://youtu.be/7iazNKG8XQo
13. Why Data Mesh at Intuit
Source Intuit July 2021 @ Data Mesh Learning Meetup - https://youtu.be/tNcxoASumB8
14. Intuit Data Mesh data products
Intuit data mesh strategy @ https://medium.com/intuit-engineering/intuits-data-mesh-strategy-778e3edaa017
15. Why Data Mesh at Zalando
Source Zalando @ Spark + AI Summit 2020 - https://youtu.be/eiUhV56uVUc
16. Moving to a Data Mesh at Zalando
Source Zalando @ Spark + AI Summit 2020 - https://youtu.be/eiUhV56uVUc
17. What can we learn from JPMC, Intuit and Zalando
1. Primary drivers - Autonomy, ownership and data-as-a-product
2. Sharing - producer/consumer model
3. Common data infrastructure - improve cost, scale and management overhead
a. JPMC opted for a build your own data lake
b. Zalando used Databricks Lakehouse as a base for their platform
c. Intuit created an open platform letting data domains choose
4. Central catalog - unified data asset discoverability, collaboration and entitlements
18. What to consider when getting started
1. What are the primary outcomes when implementing Data Mesh?
a. Autonomy - eliminating bottlenecks
b. Ownership and accountability - single owner, governance, quality and hygiene of data
c. Sharing - share and collaborate with teams to do more with data
d. Data products and data as code
2. Data infra - build vs. buy
a. Is owning the infra business critical?
b. Do you have the resources, how long will it take to build, how invested will you be 2yrs from now?
c. Can you build some and buy some?
3. What are the most important outputs you need to deliver?
a. Ownership and discoverability = unified catalog
b. Autonomy = producer/consumer, data contracts
c. Data as code = GitOps + dbt/python + data contracts
19. What to avoid early on
1. Don’t try to solve loosely defined problems
a. What does governance mean to you?
b. What does self-service analytics mean?
2. Don’t expand your scope, reduce it
a. Focus on outputs you need to deliver on your primary business outcomes
3. Don’t over complicate your architecture
a. Try to avoid doing everything that seems cool today
b. Build on top of best practices and familiar patterns - simpler to support and find help
c. Avoid vendor and technology lock-in
d. The more you build, the more you need to maintain. Avoid unnecessary tech debt
23. Summary
● Data Mesh is an organizational pattern - get your company on-board
● Identify the primary business outcomes you want to deliver with DM
● Focus on what you need to build now to deliver on an outcome soon
● Ensure data has clear ownership and accountability (quality, SLA, etc.)
● Treat data as a product
25. Thank you
Join the Upsolver Community
to continue the conversation
upsolver.com
/in/royhasson/
/in/jasonfhall/
26. Schedule a Demo: Sign Up for SQLake:
Last Resort…Email the Sales Guy:
* $20 Door Dash Gift Card for everyone that schedules a demo
Actually, There is Such a Thing as a Free Lunch…..*