Data Mesh: What is it, for Who, for who definitely not?
What are it's foundational principles and how could we take some of them to our current Data Analytical Architectures?
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Presentation on Data Mesh: The paradigm shift is a new type of eco-system architecture, which is a shift left towards a modern distributed architecture in which it allows domain-specific data and views “data-as-a-product,” enabling each domain to handle its own data pipelines.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Wonder what this data mesh stuff is all about? What are the principles of data mesh? Can you or should you consider data mesh as the approach for your analytics platform? And most important - how can Snowflake help?
Given in Montreal on 14-Dec-2021
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Data Architecture Strategies: Data Architecture for Digital TransformationDATAVERSITY
MDM, data quality, data architecture, and more. At the same time, combining these foundational data management approaches with other innovative techniques can help drive organizational change as well as technological transformation. This webinar will provide practical steps for creating a data foundation for effective digital transformation.
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Presentation on Data Mesh: The paradigm shift is a new type of eco-system architecture, which is a shift left towards a modern distributed architecture in which it allows domain-specific data and views “data-as-a-product,” enabling each domain to handle its own data pipelines.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Wonder what this data mesh stuff is all about? What are the principles of data mesh? Can you or should you consider data mesh as the approach for your analytics platform? And most important - how can Snowflake help?
Given in Montreal on 14-Dec-2021
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Data Architecture Strategies: Data Architecture for Digital TransformationDATAVERSITY
MDM, data quality, data architecture, and more. At the same time, combining these foundational data management approaches with other innovative techniques can help drive organizational change as well as technological transformation. This webinar will provide practical steps for creating a data foundation for effective digital transformation.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Building a Logical Data Fabric using Data Virtualization (ASEAN)Denodo
Watch full webinar here: https://bit.ly/3FF1ubd
In the recent Building the Unified Data Warehouse and Data Lake report by leading industry analysts TDWI, we have discovered 64% of organizations stated the objective for a unified Data Warehouse and Data Lakes is to get more business value and 84% of organizations polled felt that a unified approach to Data Warehouses and Data Lakes was either extremely or moderately important.
In this session, you will learn how your organization can apply a logical data fabric and the associated technologies of machine learning, artificial intelligence, and data virtualization can reduce time to value. Hence, increasing the overall business value of your data assets.
KEY TAKEAWAYS:
- How a Logical Data Fabric is the right approach to assist organizations to unify their data.
- The advanced features of a Logical Data Fabric that assist with the democratization of data, providing an agile and governed approach to business analytics and data science.
- How a Logical Data Fabric with Data Virtualization enhances your legacy data integration landscape to simplify data access and encourage self-service.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesEric Kavanagh
Synthesis Webcast with Eric Kavanagh and Tamr
DataOps is an emerging set of practices, processes, and technologies for building and automating data pipelines to meet business needs quickly. As these pipelines become more complex and development teams grow in size, organizations need better collaboration and development processes to govern the flow of data and code from one step of the data lifecycle to the next – from data ingestion and transformation to analysis and reporting.
DataOps is not something that can be implemented all at once or in a short period of time. DataOps is a journey that requires a cultural shift. DataOps teams continuously search for new ways to cut waste, streamline steps, automate processes, increase output, and get it right the first time. The goal is to increase agility and cycle times, while reducing data defects, giving developers and business users greater confidence in data analytic output.
This webcast examines how organizations adopt DataOps practices in the field. It will review results of an Eckerson Group survey that sheds light on the rate and scope of DataOps adoption. It will also describe case studies of organizations that have successfully implemented DataOps practices, the challenges they have encountered and benefits they’ve received.
Tune into our webcast to learn:
- User perceptions of DataOps
- The rate of DataOps adoption by industry and other demographic variables
- DataOps adoption by technique and component (i.e., agile, test automation, orchestration, continuous development/continuous integration)
- Key challenges organizations face with DataOps
- Key benefits organizations experience with DataOps
- Best practices in doing DataOps
- Case studies and anecdotes of DataOps at companies
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker
Past, present and future of data mesh at Intuit. This deck describes a vision and strategy for improving data worker productivity through a Data Mesh approach to organizing data and holding data producers accountable. Delivered at the inaugural Data Mesh Leaning meetup on 5/13/2021.
Data Lake Architecture – Modern Strategies & ApproachesDATAVERSITY
Data Lake or Data Swamp? By now, we’ve likely all heard the comparison. Data Lake architectures have the opportunity to provide the ability to integrate vast amounts of disparate data across the organization for strategic business analytic value. But without a proper architecture and metadata management strategy in place, a Data Lake can quickly devolve into a swamp of information that is difficult to understand. This webinar will offer practical strategies to architect and manage your Data Lake in a way that optimizes its success.
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
A talk presented by Max Schultze from Zalando and Arif Wider from ThoughtWorks at NDC Oslo 2020.
Abstract:
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
At Zalando - europe’s biggest online fashion retailer - we realised that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge - the data owners - while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh.
The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership.
This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture and will outline the ongoing efforts to make creation of data products as simple as applying a template.
Data Warehouse Design and Best PracticesIvo Andreev
A data warehouse is a database designed for query and analysis rather than for transaction processing. An appropriate design leads to scalable, balanced and flexible architecture that is capable to meet both present and long-term future needs. This session covers a comparison of the main data warehouse architectures together with best practices for the logical and physical design that support staging, load and querying.
Uma introdução à malha de dados e as motivações por trás dela: os modos de falhas de paradigmas anteriores de gerenciamento de big data. A proposta de Zhamak Dehghani é comparar e contrastar a malha de dados com as abordagens existentes de gerenciamento de big data, apresentando os componentes técnicos que sustentam a arquitetura de software.
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...DATAVERSITY
A solid data architecture is critical to the success of any data initiative. But what is meant by “data architecture”? Throughout the industry, there are many different “flavors” of data architecture, each with its own unique value and use cases for describing key aspects of the data landscape. Join this webinar to demystify the various architecture styles and understand how they can add value to your organization.
A Work of Zhamak Dehghani
Principal consultant
ThoughtWorks
https://martinfowler.com/articles/data-monolith-to-mesh.html
https://fast.wistia.net/embed/iframe/vys2juvzc3?videoFoam
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;)
After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters.
This is part 1 of an 8 part Data Science for Dummies series:
Databricks for dummies
Titanic survival prediction with Databricks + Python + Spark ML
Titanic with Azure Machine Learning Studio
Titanic with Databricks + Azure Machine Learning Service
Titanic with Databricks + MLS + AutoML
Titanic with Databricks + MLFlow
Titanic with DataRobot
Deployment, DevOps/MLops and Operationalization
Product-thinking is making a big impact in the data world with the rise of Data Products, Data Product Managers, data mesh, and treating “Data as a Product.” But Honest, No-BS: What is a Data Product? And what key questions should we ask ourselves while developing them? Tim Gasper (VP of Product, data.world), will walk through the Data Product ABCs as a way to make treating data as a product way simpler: Accountability, Boundaries, Contracts and Expectations, Downstream Consumers, and Explicit Knowledge.
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.
Scaling and Modernizing Data Platform with DatabricksDatabricks
Today a Data Platform is expected to process and analyze a multitude of sources spanning batch files, streaming sources, backend databases, REST APIs, and more. There is clearly a need for standardizing the platform that scales and be flexible letting data engineers and data scientists focus on the business problems rather than managing the infrastructure and backend services. Another key aspect of the platform is multi-tenancy to isolate the workloads and able to track cost usage per tenant.
In this talk, Richa Singhal and Esha Shah will cover how to build a scalable Data Platform using Databricks and deploy your data pipelines effectively while managing the costs. The following topics will be covered:
Key tenets of a Data Platform
Setup multistage environment on Databricks
Build data pipelines locally and test on Databricks cluster
CI/CD for data pipelines with Databricks
Orchestrating pipelines using Apache Airflow – Change Data Capture using Databricks Delta
Leveraging Databricks Notebooks for Analytics and Data Science teams
Agile BI Development Through AutomationManta Tools
How can code life cycle automation satisfy the growing demands in modern enterprise business intelligence?
Whilst an agile approach to BI development is useful for delivering value in general, the use of advanced automation techniques can also save significant resources, prevent production errors, and shorten time to market.
Gentlemen from Data To Value, Manta Tools, Volkswagen and M&G investments presented and discussed different approaches to agile BI development. Take a look!
The content of the document, "Implementing Data Mesh: Six Ways That Can Improve the Odds of Your Success," is a whitepaper authored by Ranganath Ramakrishna from LTIMindtree. The whitepaper introduces the concept of Data Mesh, a socio-technical paradigm that aims to help organizations fully leverage the value of their analytical data.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Building a Logical Data Fabric using Data Virtualization (ASEAN)Denodo
Watch full webinar here: https://bit.ly/3FF1ubd
In the recent Building the Unified Data Warehouse and Data Lake report by leading industry analysts TDWI, we have discovered 64% of organizations stated the objective for a unified Data Warehouse and Data Lakes is to get more business value and 84% of organizations polled felt that a unified approach to Data Warehouses and Data Lakes was either extremely or moderately important.
In this session, you will learn how your organization can apply a logical data fabric and the associated technologies of machine learning, artificial intelligence, and data virtualization can reduce time to value. Hence, increasing the overall business value of your data assets.
KEY TAKEAWAYS:
- How a Logical Data Fabric is the right approach to assist organizations to unify their data.
- The advanced features of a Logical Data Fabric that assist with the democratization of data, providing an agile and governed approach to business analytics and data science.
- How a Logical Data Fabric with Data Virtualization enhances your legacy data integration landscape to simplify data access and encourage self-service.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesEric Kavanagh
Synthesis Webcast with Eric Kavanagh and Tamr
DataOps is an emerging set of practices, processes, and technologies for building and automating data pipelines to meet business needs quickly. As these pipelines become more complex and development teams grow in size, organizations need better collaboration and development processes to govern the flow of data and code from one step of the data lifecycle to the next – from data ingestion and transformation to analysis and reporting.
DataOps is not something that can be implemented all at once or in a short period of time. DataOps is a journey that requires a cultural shift. DataOps teams continuously search for new ways to cut waste, streamline steps, automate processes, increase output, and get it right the first time. The goal is to increase agility and cycle times, while reducing data defects, giving developers and business users greater confidence in data analytic output.
This webcast examines how organizations adopt DataOps practices in the field. It will review results of an Eckerson Group survey that sheds light on the rate and scope of DataOps adoption. It will also describe case studies of organizations that have successfully implemented DataOps practices, the challenges they have encountered and benefits they’ve received.
Tune into our webcast to learn:
- User perceptions of DataOps
- The rate of DataOps adoption by industry and other demographic variables
- DataOps adoption by technique and component (i.e., agile, test automation, orchestration, continuous development/continuous integration)
- Key challenges organizations face with DataOps
- Key benefits organizations experience with DataOps
- Best practices in doing DataOps
- Case studies and anecdotes of DataOps at companies
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker
Past, present and future of data mesh at Intuit. This deck describes a vision and strategy for improving data worker productivity through a Data Mesh approach to organizing data and holding data producers accountable. Delivered at the inaugural Data Mesh Leaning meetup on 5/13/2021.
Data Lake Architecture – Modern Strategies & ApproachesDATAVERSITY
Data Lake or Data Swamp? By now, we’ve likely all heard the comparison. Data Lake architectures have the opportunity to provide the ability to integrate vast amounts of disparate data across the organization for strategic business analytic value. But without a proper architecture and metadata management strategy in place, a Data Lake can quickly devolve into a swamp of information that is difficult to understand. This webinar will offer practical strategies to architect and manage your Data Lake in a way that optimizes its success.
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
A talk presented by Max Schultze from Zalando and Arif Wider from ThoughtWorks at NDC Oslo 2020.
Abstract:
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
At Zalando - europe’s biggest online fashion retailer - we realised that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge - the data owners - while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh.
The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership.
This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture and will outline the ongoing efforts to make creation of data products as simple as applying a template.
Data Warehouse Design and Best PracticesIvo Andreev
A data warehouse is a database designed for query and analysis rather than for transaction processing. An appropriate design leads to scalable, balanced and flexible architecture that is capable to meet both present and long-term future needs. This session covers a comparison of the main data warehouse architectures together with best practices for the logical and physical design that support staging, load and querying.
Uma introdução à malha de dados e as motivações por trás dela: os modos de falhas de paradigmas anteriores de gerenciamento de big data. A proposta de Zhamak Dehghani é comparar e contrastar a malha de dados com as abordagens existentes de gerenciamento de big data, apresentando os componentes técnicos que sustentam a arquitetura de software.
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...DATAVERSITY
A solid data architecture is critical to the success of any data initiative. But what is meant by “data architecture”? Throughout the industry, there are many different “flavors” of data architecture, each with its own unique value and use cases for describing key aspects of the data landscape. Join this webinar to demystify the various architecture styles and understand how they can add value to your organization.
A Work of Zhamak Dehghani
Principal consultant
ThoughtWorks
https://martinfowler.com/articles/data-monolith-to-mesh.html
https://fast.wistia.net/embed/iframe/vys2juvzc3?videoFoam
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;)
After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters.
This is part 1 of an 8 part Data Science for Dummies series:
Databricks for dummies
Titanic survival prediction with Databricks + Python + Spark ML
Titanic with Azure Machine Learning Studio
Titanic with Databricks + Azure Machine Learning Service
Titanic with Databricks + MLS + AutoML
Titanic with Databricks + MLFlow
Titanic with DataRobot
Deployment, DevOps/MLops and Operationalization
Product-thinking is making a big impact in the data world with the rise of Data Products, Data Product Managers, data mesh, and treating “Data as a Product.” But Honest, No-BS: What is a Data Product? And what key questions should we ask ourselves while developing them? Tim Gasper (VP of Product, data.world), will walk through the Data Product ABCs as a way to make treating data as a product way simpler: Accountability, Boundaries, Contracts and Expectations, Downstream Consumers, and Explicit Knowledge.
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.
Scaling and Modernizing Data Platform with DatabricksDatabricks
Today a Data Platform is expected to process and analyze a multitude of sources spanning batch files, streaming sources, backend databases, REST APIs, and more. There is clearly a need for standardizing the platform that scales and be flexible letting data engineers and data scientists focus on the business problems rather than managing the infrastructure and backend services. Another key aspect of the platform is multi-tenancy to isolate the workloads and able to track cost usage per tenant.
In this talk, Richa Singhal and Esha Shah will cover how to build a scalable Data Platform using Databricks and deploy your data pipelines effectively while managing the costs. The following topics will be covered:
Key tenets of a Data Platform
Setup multistage environment on Databricks
Build data pipelines locally and test on Databricks cluster
CI/CD for data pipelines with Databricks
Orchestrating pipelines using Apache Airflow – Change Data Capture using Databricks Delta
Leveraging Databricks Notebooks for Analytics and Data Science teams
Agile BI Development Through AutomationManta Tools
How can code life cycle automation satisfy the growing demands in modern enterprise business intelligence?
Whilst an agile approach to BI development is useful for delivering value in general, the use of advanced automation techniques can also save significant resources, prevent production errors, and shorten time to market.
Gentlemen from Data To Value, Manta Tools, Volkswagen and M&G investments presented and discussed different approaches to agile BI development. Take a look!
The content of the document, "Implementing Data Mesh: Six Ways That Can Improve the Odds of Your Success," is a whitepaper authored by Ranganath Ramakrishna from LTIMindtree. The whitepaper introduces the concept of Data Mesh, a socio-technical paradigm that aims to help organizations fully leverage the value of their analytical data.
When writing this new paper, my main objective was to provide a clear understanding of where the term "Big Data" comes from, why is that term so popular now, what does it really mean and what can be its implication for businesses. Because the full power of Big Data can be revealed only by Analytics, i provided a description of a widely recognized and used analytical techniques to help you figure out how used in conjunction with Big Data, analytics can boost Business Performance.
i expected that by the end of this paper :
- you will smile the next time you read or hear at the terms big data, hadoop, or analytics :)
- you will understand the technologies that are behind the scene when one talks about "Big Data"
- you will know how to "make sense" of Big Data using Analytics
- you will get a basic idea of data mining techniques used in Business in general and in Big Data in particular
- you will be able to get every news about Big Data
Data Product Management by Tinder Group PMProduct School
Main Takeaways:
- What is Data Product Management
- Who is a Data Product Manager and what do they do
- How and where to get started to get a role in Data Product Management
Information is at the heart of all architecture disciplines & why Conceptual ...Christopher Bradley
Information is at the heart of all of the architecture disciplines such as Business Architecture, Applications Architecture and Conceptual Data Modelling helps this.
Also, data modelling which helps inform this has been wrongly taught as being just for Database design in many Universities.
chris.bradley@dmadvisors.co.uk
Make compliance fulfillment count doubleDirk Ortloff
This whitepaper gives an overview about the requirements and the approaches to
make your compliance initiative count double. Not only to fulfill compliance but to go
the next step bringing your documentation and knowledge handling to a stage where
future projects can learn from previous successes and mistakes. This will make your
R&D department ready for future challenges, faster markets and global
partnerships.
Library systems are no longer ‘stand alone’. Global technology influences are driving the market more than ever. There is a risk that the solutions libraries provide remain detached from truly meeting the real needs of many users - staff , academics, researchers and students.
Instead of library systems.or even 'next generation' library services platforms we need to think in terms of the wider library technology ‘ecosystem’. That changes how make our decisions about the products we buy and the services libraries deliver
Smarter Analytics: Supporting the Enterprise with AutomationInside Analysis
The Briefing Room with Barry Devlin and WhereScape
Live Webcast on June 10, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=5230c31ab287778c73b56002bc2c51a
The data warehouse is intended to support analysis by making the right data available to the right people in a timely fashion. But conditions change all the time, and when data doesn’t keep up with the business, analysts quickly turn to workarounds. This leads to ungoverned and largely un-managed side projects, which trade short-term wins for long-term trouble. One way to keep everyone happy is by creating an integrated environment that pulls data from all sources, and is capable of automating both the model development and delivery of analyst-ready data.
Register for this episode of The Briefing Room to hear data warehousing pioneer and Analyst Barry Devlin as he explains the critical components of a successful data warehouse environment, and how traditional approaches must be augmented to keep up with the times. He’ll be briefed by WhereScape CEO Michael Whitehead, who will showcase his company’s data warehousing automation solutions. He’ll discuss how a fast, well-managed and automated infrastructure is the key to empowering faster, smarter, repeatable decision making.
Visit InsideAnlaysis.com for more information.
Agile, Automated, Aware: How to Model for SuccessInside Analysis
The Briefing Room with David Loshin and Embarcadero
Live Webcast October 27, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/onstage/g.php?MTID=eea9877b71c653c499c809c5693eae8fe
Data management teams face some tough challenges these days. Organizations need business-driven visibility that enables understanding and awareness of enterprise data assets – without worrying about definitions and change management. But with information architectures evolving into a hybrid mix of data objects and data services built over relational databases as well as big data stores, serving up accurately defined, reusable data can become a complex issue.
Register for this episode of The Briefing Room to learn from veteran Analyst David Loshin as he explains the importance of agile, automated workflows in today’s enterprise. He’ll be briefed by Ron Huizenga of Embarcadero, who will discuss how his company’s ER/Studio suite approaches data modeling and management from a modern architecture standpoint. He will explain that unifying the way information is represented can not only eliminate the need for costly workarounds, but also foster collaboration between data architects, developers and business users.
Visit InsideAnalysis.com for more information.
Boston Data Engineering: Designing and Implementing Data Mesh at Your Company...Boston Data Engineering
Data Mesh is a fairly new approach to help companies do more with data, faster. It requires both organizational and technical changes to enable autonomy and self-service, treat data as a product and encourage secure collaboration.
In this session, we will discuss practical approaches you can implement today to help your company start benefiting from Data Mesh. We'll show you how to create autonomy by splitting responsibility between data producers and consumers, share datasets and make data discovery easy.
We'll show a demo with producers building an ingestion pipeline that publishes datasets to consumer accounts (data mesh domains). SQL templates will be provided for members to follow along and build on their own.
We'll present these use cases built with data mesh design patterns:
1. A multi-tenant data lake that allows data producers to share datasets with consumers outside of the organization (3rd parties).
2. A security data lake that allows different teams to publish curated logs to their local Elasticsearch clusters for analysis, and to a central data lake for retention, auditing and historical analysis.
We'll also discuss managing data contracts/schemas between producers and consumers, to enable ownership and better data quality when sharing datasets.
Meetup: https://www.meetup.com/boston-data-engineering/events/291383661/
Video: https://youtu.be/lIcmomYZ3mo
Business in the Driver’s Seat – An Improved Model for IntegrationInside Analysis
The Briefing Room with Dr. Robin Bloor and WhereScape
Live Webcast on September 30, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=bfff40f7c9645fc398770ea11152b148
The fueling of information systems will always require some effort, but a confluence of innovations is fundamentally changing how quickly and accurately it can be done. Gone are long cycle times for development. Today, organizations can embrace a more rapid and collaborative approach for building analytical applications and data warehouses. The key is to have business experts working hand-in-hand with data professionals as the solutions take shape, thus expediting the speed to valuable insights.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains the changing nature of information design. He’ll be briefed by WhereScape President Mark Budzinski, who will discuss his company’s data warehouse automation solutions and how they enable collaborative development. He will share use cases that illustrate show aligning business and IT, organizations can enable faster and more agile data warehouse development.
Visit InsideAnlaysis.com for more information.
Data. It keeps coming up time and time again. On our social media feeds, in our client conversations, and has of course been the driver behind never-before-seen tools like ChatGPT.
But how can you do more with the data your organisation has and produces? What is data engineering and big data, and how can you enable data-driven decision-making within your organisation?
Hear from Nabi Rezvani—Lead Data Engineer—and Gaurav Thadani—Lead Software Engineer at DiUS on the latest trends, use cases and real-life examples of how our clients are using data and analytics to improve their decision making, customer experiences and business operations.
Also joining us are Jonathan Gomez—Head of Data Platforms at Wesfarmers OneDigital OnePass—and John Sullivan—CEO at ChargeFox—on their own [big and small] data journeys, along with the lessons they’ve learned along the way.
Watch the presentation on YouTube: https://youtu.be/ccghOfcdGN8
This takes a look at the architectural constructs that are used for building business intelligence systems and how they are used in business processes to improve marketing, better serve customers, and maximize organizational efficiency.
Data Integration is a key part of many of today’s data management challenges: from data warehousing, to MDM, to mergers & acquisitions. Issues can arise not only in trying to align technical formats from various databases and legacy systems, but in trying to achieve common business definitions and rules.
Join this webinar to see how a data model can help with both of these challenges – from ‘bottom-up’ technical integration, to the ‘top-down’ business alignment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
3. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
3
No, it won’t ‘save us’….
There is NO quick fix to become ‘Data Driven’
or ‘information supported’
or whatever you want to call what we are doing here…
But there are Data Mesh aspects that make sense!
…if you ignore most of the (tech) vendor washing…
As in…
4. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
4
About me...
Rogier Werschkull
21 years ‘in field’ of Data, Data warehousing and
Business Intelligence
Data architecture advise, data modeling, data
engineering, data-analytics product owner
Blogger, trainer, conference speaker
Contact details:
www.linkedin.com/in/rogierwerschkull/
rogier@rogerdata.nl
@rwerschkull
8. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
8
“Every single company I've worked at and talked to has the
same problem without a single exception so far:
poor data quality...
Either there's incomplete data, missing data,
duplicative data.”
Ruslan Belkin, former VP of Engineering @ Twitter and Salesforce
9. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
9
Data mesh tries (again?) to combat the ‘Analytics Misery’
we tend to create…
HOW?
By decentralizing most ‘data warehousing concerns’ to individual
business domains. As in:
1. In either the operational source system
2. Or in a decentralized DWH team that sits ‘closer’ to the source systems
By ‘calling out’ the required organizational / cultural change to
accomplish this…
What is data Mesh-1?
10. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
10
Data mesh is described in the official Data Mesh Book by
Zhamak Dehghani (thoughtworks) as follows:
‘Data mesh is a decentralized sociotechnical approach to share,
access, and manage analytical data in complex and large-scale
environments—within or across organizations’
What is data Mesh-2?
11. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
11
Only for large organizations, as in:
Having a lot of source systems
A lot of employees
Where there are clear / separated business domains
Only for large organizations that…
are not afraid to experiment
• That can live with the current absence of viable implementation patterns
have a mature (centralized?) data / analytics department
preferably can influence the design / development of the operational
applications they use
For WHO and WHEN?
12. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
12
Data mesh is foremost about
people and processes
About chaining the data-analytics
‘culture’
About the process of
collaborating on…
creating decentralized, valuable
data products within a ‘business
domain’
and sharing data between
these domains
NOT about technology!
People
Process
Data
Technology
14. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
14
It’s not that I don’t agree with the core data mesh principles
It’s the way they are explained: ‘quite academic’
Implementation guidelines are still missing
• Which Zhamak also clearly mentions, as in that it is still an emerging concept!
But then still, some vital context is truly missing
A mayor issue I have with the book (that we need to counter)
I really have doubts on the amount of data integration experience the
contributors have, based on that
• it states that folks building DWH’s are still striving for the ‘single version of the truth’
(if you lived 10 years ago yes…)
• there is no mention at all of modern ELM-based data modeling patterns
(that are there to help data integration)
Quite often the book’s content does
not help in this aspect…
18. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
18
It’s about doing
this
analytical
work with
data
somehow
somewhere
Structure information so it can
be consumed easily. Shaped
for a diverse type of users, use
cases and tools
Reliable, durable
integration / unification
of data
Register the history and
history of changes
to data
Store data you receive
once, protected from
ungoverned deletion
DWH
19. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
19
Common DWH patterns overview
Data -
analytics
Architectur
e
What is it? Architecture
solution
features
Most common
development
style
Data warehousing ‘concern’
Subject Oriented Integrated Time-Variant Non-Volatile
Data lake A repository of raw
data of any type for
analytics purposes
Decentralized
tech.
On premise or
cloud
Decentralized
‘puddles of lake’
No No Yes, in general
implemented as a
file store
Yes, in general
implemented as a file
store
DWH 3.0 /
‘Lakehouse
’
The re-merger of
data lake and
‘classical’ DWH
concerns, also
known as the
‘Modern DWH’
Centralized
tech. Cloud
based
Centralized or
decentralized
depending on
business
complexity
Yes, via database
transformation
rules
Yes, via database
integration rules
Yes, via a database
Historical Staging
Area
Yes, via a database
Historical Staging
Area
Data
Mesh
Distributed data
architecture that
pushes down
‘DWH concerns’
the source /
‘business domain’
Highly
decentralized
tech.
On premise
or cloud
Highly
decentralized
by definition.
Focus on ‘data
as a product’
Yes, but
mainly pushed
to the ‘business
domains’
Yes, local
withing the
business
domain and
centralized via
a ‘knowledge
graph’ like
‘mapping’
Yes, but pushed
to the ‘business
domains’
Yes, but pushed
to the ‘business
domains’
Data Fabric Distributed data
architecture where
‘time variant / non
volatile’ concerns are
pushed down to the
source systems
Centralized
tech
On premise or
cloud.
Sources
decentralized
Centralized or
decentralized
depending on
business
complexity
Yes, via
centralised
virtual
transformation
rules
Yes, via
centralised
virtual integration
logic
What the
operational
system provides
or by creating a
Historical Staging
area in an analytical
What the
operational system
provides or by
creating a Historical
Staging area in an
analytical database
21. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
21
1. Principle of domain ownership
1. Analytical data should be owned by either the source system or its main
consumers
2.Data as a product
1.Build data artifacts with a true product
(management) mindset
3. Self Service data Platform
Use (shared) Infrastructure as a platform (in the cloud?) to build this
4. Federated Computational Governance
Data governance operating model based on a federated decision-making and
accountability
Based on these foundational principles…
24. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
24
Feasible
Can the product be made in an acceptable time & for acceptable costs?
Valuable
What are the desires of my customers?
What is my market?
• How to do marketing?
What is the USP?
What ‘price is justified?’
Are my customers happy?
Usable
Is the product being used?
Is the product easy (enough) to use?
Are my customers happy?
Some examples of work you’ll need to do
here!
25. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
25
To see ‘data / information as a product’*it practically needs to be:
Discoverable
An easy, google-like way to find data sets
Addressable
The product needs to have a permanent unique identifier that stays stable over time
Understandable
The product needs to accompanied with metadata that describes WHAT something is
Trustworthy and ‘truthful’
The product needs to have a lot a data quality metrics and lineage metadata attached
Natively accessible
Accessible via any interface that suits the consumer, ie as API / via ODBC-SQL / stream ‘topic’
Interoperable and composable
The product needs to be accompanied with metadata on HOW it can be combined with other products
Valuable on its own
Useable without the need to first combine it with other data products
Secure
Data security / privacy needs to work on the product without needing ‘something else’
Data Mesh Data Product principles
26. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
26
…an extension / repackaging of the existing FAIR data
principles
https://www.go-fair.org/fair-principles/
This is not completely new…
FAIR DATA MESH
Findable Discoverable
Understandable
Accessible Addressable
Natively accessible
Secure
Interoperable Trustworthy and ‘truthful’
Interoperable and composable
Valuable on its own
Reusable Natively accessible
Trustworthy and ‘truthful’
Understandable
29. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
29
‘Pushing down’ DWH concerns to the operational systems
will likely be a long journey
In addition, a lot of the tech mentioned in the book to cover
some vital aspects of the data mesh does not exist yet
The alternative I see…
Use the DWH 3.0 / Lakehouse pattern
Make sure to cover the mentioned Data Mesh principles there
• I think the key there is to use Data Vault or other ELM-based data
modeling style as an enabler
Overall, this would be my starting point
when MVP-ing a ‘meshy architecture’
30. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
30
Subject Oriented
Create domain specific and centralized hubs
Create domain specific satellites
Integrated
Domain specific hubs should (obviously) be integrated within a domain
Centralized hubs should be fed from all domains and / or an central MDM
source
Centralized same as links should be created to enable integrating across
domains
• Domain specific satellites can then be shared too
Time variant & non volatile
Before the Subject Oriented / Integrated step, data should be loaded RAW in
a Historical Staging Area first
Implementing the ‘DWH concerns’ using
Data Vault
32. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
32
Implementing ‘data as a product’ in
the DWH 3.0 / Lakehouse pattern
Product Principle My first implementation ‘idea’ / suggestion
Discoverable • Data product metadata should be pushed or pulled from all domains towards a data catalog with a good
search interface where NO MANUAL CURATION is needed!
Addressable • If source entity names change, the entity names should remain stable. That’s also the purpose of a GOOD
data vault hub
Trustworthy and ‘truthful’ • Data quality test should be part of each data product, not having them should block releasing them
• The data catalog mentioned in discoverable should handle lineage
Natively accessible • Next to storing data products ‘in a database’, and using ODBC/ JDBC to access, create a data-API on top
of each data product or make sure it can be used via an API too
• Database systems with ‘low friction’ data sharing capabilities could help here
Interoperable and
composable
• Embed metadata from parent / child data products in the data product itself
• Again, a data catalog plays a central role here
Valuable on its own • ELM based data modeling patterns (like Data Vault) and a datamart modeling style like Kimball /
Dimensional is still the way to go
Secure • Use the native row and column level security features of modern cloud based analytical databases.
• Register these policies as metadata.
• This requires data product consumers to consume using named accounts only
34. @rwerschkull
nl.linkedin.com/in/rogierwerschkull
34
All relevant data mesh info is collected at on
https://datameshlearning.com/user-stories/
Scott Hireleman drives his initiative
He also host the accompanying Data Mesh Radio podcast
• Listen to Shane Gibson’s (Knowledge Gap presenter) here:
https://daappod.com/data-mesh-radio/repeatable-patterns-and-data-mesh-shane-gibson/
Check out the user journeys here:
https://datameshlearning.com/user-stories/
LAST: Where can I find more info?
Our industry is still immature:
https://www.linkedin.com/pulse/note-piergiuseppe-bill-inmon/
Summary:
-One of the signs of immaturity of our industry is the practice of depending on vendors to lead the industry.
-Because we are a young and immature industry, there are new advancements that occur every day.
-Because of the newness of our industry there are very few principles. There are new toys. There are new gadgets.
-When a new product or technology comes into the marketplace, the vendor thinks that it is their duty to remove everything that has come before.
-There is a secret to combatting the vendors who are telling you that you are dated and old. The secret is to deliver business value to your end user.
Our industry is still immature:
https://www.linkedin.com/pulse/note-piergiuseppe-bill-inmon/
Summary:
-One of the signs of immaturity of our industry is the practice of depending on vendors to lead the industry.
-Because we are a young and immature industry, there are new advancements that occur every day.
-Because of the newness of our industry there are very few principles. There are new toys. There are new gadgets.
-When a new product or technology comes into the marketplace, the vendor thinks that it is their duty to remove everything that has come before.
-There is a secret to combatting the vendors who are telling you that you are dated and old. The secret is to deliver business value to your end user.
What does this number mean?
Yes, it is the failure rate of BI, analytics and classical datawarehousing initiatives
But also of Big data, Data lake, IOT or AI projects. It is the amount of ML work that never sees the light of day in your production enrironment
And I don’t make this up, it is being said again and again by the likes of gartner, forrester, CIO.com, Cisco
There are 2 reasons
Our industry is still immature:
https://www.linkedin.com/pulse/note-piergiuseppe-bill-inmon/
Summary:
-One of the signs of immaturity of our industry is the practice of depending on vendors to lead the industry.
-Because we are a young and immature industry, there are new advancements that occur every day.
-Because of the newness of our industry there are very few principles. There are new toys. There are new gadgets.
-When a new product or technology comes into the marketplace, the vendor thinks that it is their duty to remove everything that has come before.
-There is a secret to combatting the vendors who are telling you that you are dated and old. The secret is to deliver business value to your end user.
My take in on the primary reason WHY this lasting failure this is still happening relates to us still not addressing the data quality problem structurally
Quality is still an afterthought
Not only bull inmon, this is a guy working a t salesforce, modern cloud based saas company.
In my opinions solving / addressing / governing these data quality issues are implicitly the core of what datawarehousing methodology should address
My take in on the primary reason WHY this lasting failure this is still happening relates to us still not addressing the data quality problem structurally
Quality is still an afterthought
Not only bull inmon, this is a guy working a t salesforce, modern cloud based saas company.
In my opinions solving / addressing / governing these data quality issues are implicitly the core of what datawarehousing methodology should address
IMHO: it is ‘just’ a NEW form of Decentralized Data Warehousing
Is data mesh an architecture?
Is it a list of principles?
Is it an operating model?
After all, we rely on the classification of patterns as a major cognitive function to understand the structure of our world.
Hence, I have decided to classify data mesh as a sociotechnical paradigm: an approach that recognizes the interactions between people and the technical architecture and solutions in complex organizations
When people whould have read the data mesh book, even Zhamak herself writes down that it is still an emerging concept. That a lot of the tech to build what she is writing down conceptually DOES NOT EVEN EXIST (YET).
As such, no one can actual claim to 'sell' a data mesh or claim to have build a 'full fledged' one.
The only claim that could be true is:
that people are 'on a journey’ towards creating a data mesh
that vendors sell a tech component that might be applied when designing / building a data mesh.
Is data mesh an architecture?
Is it a list of principles?
Is it an operating model?
After all, we rely on the classification of patterns as a major cognitive function to understand the structure of our world.
Hence, I have decided to classify data mesh as a sociotechnical paradigm: an approach that recognizes the interactions between people and the technical architecture and solutions in complex organizations
That is quite a lot of people that will need to be protected, so read up!
Data mesh calls for a fundamental shift in the assumptions, architecture, technical solutions, and social structure of our organizations, in how we manage, use, and own analytical data:
Organizationally, it shifts from centralized ownership of data by specialists who run the data platform technologies to a decentralized data ownership model pushing ownership and accountability of the data back to the business domains where data is produced from or is used.
Architecturally, it shifts from collecting data in monolithic warehouses and lakes to connecting data through a distributed mesh of data products accessed through standardized protocols.
Technologically, it shifts from technology solutions that treat data as a byproduct of running pipeline code to solutions that treat data and code that maintains it as one lively autonomous unit.
Operationally, it shifts data governance from a top-down centralized operational model with human interventions to a federated model with computational policies embedded in the nodes on the mesh.
Principally, it shifts our value system from data as an asset to be collected to data as a product to serve and delight the data users (internal and external to the organization).
Infrastructurally, it shifts from two sets of fragmented and point-to-point integrated infrastructure services—one for data and analytics and the other for applications and operational systems to a well-integrated set of infrastructure for both operational and data systems.
datawarehousing is an activity, supported by a methodology. It has has nothing to do with technology directly
, it’s about adressing these data-analytical concerns
These four words say NOTHING about technology. Zip. Nada. They describe what functionally needs to happen.
In traditional DWH modeling apraches you still do this work ‘in one go’
But to be fair, that really is a problem:
What about
modelling time vs added value,
reverse engineering,
starting with a data first / data centric architecture?
Agility
https://www.gartner.com/smarterwithgartner/gartner-top-10-data-and-analytics-trends-for-2021/
https://www.slideshare.net/ParisDataEngineers/delta-lake-oss-create-reliable-and-performant-data-lake-by-quentin-ambard
Data Lakehouse: https://www.snowflake.com/guides/what-data-lakehouse
https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
https://medium.com/snowflake/selling-the-data-lakehouse-a9f25f67c906
Delta lake:https://docs.databricks.com/delta/index.html
Data Mesh:
https://francois-nguyen.blog/2021/03/07/towards-a-data-mesh-part-1-data-domains-and-teams-topologies/
https://martinfowler.com/articles/data-monolith-to-mesh.html
https://dpgmedia-engineering.medium.com/ddd-data-area-at-dpg-media-f0130e4d9766
Data mesh calls for a fundamental shift in the assumptions, architecture, technical solutions, and social structure of our organizations, in how we manage, use, and own analytical data:
If something is feasible, then you can do it without too much difficulty. When someone asks "Is it feasible?" the person is asking if you'll be able to get something done.
=Capable of being done with means at hand and circumstances as they are.
synonyms: executable, practicable, viable, workable
All this is work that (NOW) often does not happen and makes sense to think about, determine and measure!
The teams have the responsibility to provide data that is easily discoverable, understandable, accessible, and usable, known as data products. There are established roles such as data product owners in each cross-functional domain team that are responsible for data and sharing it successfully
This is missing from the book!
The teams have the responsibility to provide data that is easily discoverable, understandable, accessible, and usable, known as data products. There are established roles such as data product owners in each cross-functional domain team that are responsible for data and sharing it successfully
The teams have the responsibility to provide data that is easily discoverable, understandable, accessible, and usable, known as data products. There are established roles such as data product owners in each cross-functional domain team that are responsible for data and sharing it successfully
The teams have the responsibility to provide data that is easily discoverable, understandable, accessible, and usable, known as data products. There are established roles such as data product owners in each cross-functional domain team that are responsible for data and sharing it successfully
build own to companies:
Building a decentralized DWH in a database
Using centralized configurable cloud native infra (Snowflake, Bigquery, Databricks)
And therefore I don’t believe in cloud DWH as the answer that makes datawarehousing successful suddenly Who said this?