This document discusses Data Vault fundamentals and best practices. It introduces Data Vault modeling, which involves modeling hubs, links, and satellites to create an enterprise data warehouse that can integrate data sources, provide traceability and history, and adapt incrementally. The document recommends using data virtualization rather than physical data marts to distribute data from the Data Vault. It also provides recommendations for further reading on Data Vault, Ensemble modeling, data virtualization, and certification programs.
Given at Oracle Open World 2011: Not to be confused with Oracle Database Vault (a commercial db security product), Data Vault Modeling is a specific data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It has been in use globally for over 10 years now but is not widely known. The purpose of this presentation is to provide an overview of the features of a Data Vault modeled EDW that distinguish it from the more traditional third normal form (3NF) or dimensional (i.e., star schema) modeling approaches used in most shops today. Topics will include dealing with evolving data requirements in an EDW (i.e., model agility), partitioning of data elements based on rate of change (and how that affects load speed and storage requirements), and where it fits in a typical Oracle EDW architecture. See more content like this by following my blog http://kentgraziano.com or follow me on twitter @kentgraziano.
This is a presentation I gave in 2006 for Bill Inmon. The presentation covers Data Vault and how it integrates with Bill Inmon's DW2.0 vision. This is focused on the business intelligence side of the house.
IF you want to use these slides, please put (C) Dan Linstedt, all rights reserved, http://LearnDataVault.com
Agile Data Engineering - Intro to Data Vault Modeling (2016)Kent Graziano
(Updated deck) As we move more and more towards the need for everyone to do Agile Data Warehousing, we need a data modeling method that can be agile with us. Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for over 10 years but is still not widely known or understood. The purpose of this presentation is to provide attendees with an introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics:
• What the basic components of a DV model are
• How to build, and design structures incrementally, without constant refactoring
Data Vault Modeling and Methodology introduction that I provided to a Montreal event in September 2011. It covers an introduction and overview of the Data Vault components for Business Intelligence and Data Warehousing. I am Dan Linstedt, the author and inventor of Data Vault Modeling and methodology.
If you use the images anywhere in your presentations, please credit http://LearnDataVault.com as the source (me).
Thank-you kindly,
Daniel Linstedt
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingKent Graziano
This is a presentation I gave at OUGF14 in Helsinki, Finland.
Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for the last 10 years but is still not widely known or understood. The purpose of this presentation is to provide attendees with a detailed introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics for how to build, and design structures incrementally, without constant refactoring, when using the Data Vault modeling technique. This technique works well for:
• Building the Enterprise Data Warehouse repository in a CIF architecture
• Building a Persistent Staging Area (PSA) in a Kimball Bus Architecture
• Building your data model incrementally, one sprint at a time using a repeatable technique
• Providing a model that is easily extensible without need to re-engineer existing structure or load processes
DAMA, Oregon Chapter, 2012 presentation - an introduction to Data Vault modeling. I will be covering parts of the methodology, comparison and contrast of issues in general for the EDW space. Followed by a brief technical introduction of the Data Vault modeling method.
After the presentation i I will be providing a demonstration of the ETL loading layers, LIVE!
You can find more on-line training at: http://LearnDataVault.com/training
Given at Oracle Open World 2011: Not to be confused with Oracle Database Vault (a commercial db security product), Data Vault Modeling is a specific data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It has been in use globally for over 10 years now but is not widely known. The purpose of this presentation is to provide an overview of the features of a Data Vault modeled EDW that distinguish it from the more traditional third normal form (3NF) or dimensional (i.e., star schema) modeling approaches used in most shops today. Topics will include dealing with evolving data requirements in an EDW (i.e., model agility), partitioning of data elements based on rate of change (and how that affects load speed and storage requirements), and where it fits in a typical Oracle EDW architecture. See more content like this by following my blog http://kentgraziano.com or follow me on twitter @kentgraziano.
This is a presentation I gave in 2006 for Bill Inmon. The presentation covers Data Vault and how it integrates with Bill Inmon's DW2.0 vision. This is focused on the business intelligence side of the house.
IF you want to use these slides, please put (C) Dan Linstedt, all rights reserved, http://LearnDataVault.com
Agile Data Engineering - Intro to Data Vault Modeling (2016)Kent Graziano
(Updated deck) As we move more and more towards the need for everyone to do Agile Data Warehousing, we need a data modeling method that can be agile with us. Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for over 10 years but is still not widely known or understood. The purpose of this presentation is to provide attendees with an introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics:
• What the basic components of a DV model are
• How to build, and design structures incrementally, without constant refactoring
Data Vault Modeling and Methodology introduction that I provided to a Montreal event in September 2011. It covers an introduction and overview of the Data Vault components for Business Intelligence and Data Warehousing. I am Dan Linstedt, the author and inventor of Data Vault Modeling and methodology.
If you use the images anywhere in your presentations, please credit http://LearnDataVault.com as the source (me).
Thank-you kindly,
Daniel Linstedt
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingKent Graziano
This is a presentation I gave at OUGF14 in Helsinki, Finland.
Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for the last 10 years but is still not widely known or understood. The purpose of this presentation is to provide attendees with a detailed introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics for how to build, and design structures incrementally, without constant refactoring, when using the Data Vault modeling technique. This technique works well for:
• Building the Enterprise Data Warehouse repository in a CIF architecture
• Building a Persistent Staging Area (PSA) in a Kimball Bus Architecture
• Building your data model incrementally, one sprint at a time using a repeatable technique
• Providing a model that is easily extensible without need to re-engineer existing structure or load processes
DAMA, Oregon Chapter, 2012 presentation - an introduction to Data Vault modeling. I will be covering parts of the methodology, comparison and contrast of issues in general for the EDW space. Followed by a brief technical introduction of the Data Vault modeling method.
After the presentation i I will be providing a demonstration of the ETL loading layers, LIVE!
You can find more on-line training at: http://LearnDataVault.com/training
I gave this presentation at the Advanced Architecture Conference, Bill Inmon, 2011 in Evergreen, Colorado. This presentation covers a new breed of data warehousing called Operational Data Warehousing. These are the next steps in business intelligence towards self-service BI and enabling users to do more with their enterprise data warehouse solution. Specifically, it talks about how the Data Vault model fits in to this picture.
If you would like to use the slides, please e-mail me first, I'd be happy to discuss it with you.
Consensus and Raft Algorithm in Distributed SystemThao Huynh Quang
A modern computing system requires multiple components distributed on different machines to provide scalability, high availability, fault tolerance, and low latency. Therefore, consensus is essential when communicating between nodes to agree on some data value required during computation.
There are many examples of consensus around our tooling:
- Google Chubby uses Paxos, which is a consensus algorithm.
- Kubernetes uses etcd as the backing store for all cluster data. Etcd uses Raft, which is a consensus algorithm.
- Hadoop; Kafka uses ZooKeeper for service discovery; leader’s election, ... ZooKeeper uses a Paxos-variant algorithm.
- Blockchain cannot exist without consensus algorithms such as Proof-of-work; Proof-of-stake; ...
As the result, knowledge of consensus in the distributed system is crucial to understand the behavior of those systems. This presentation will:
Brief introduction about the consensus; challenges and achieved result.
Raft algorithm - the main consensus algorithm used in most of the recent systems.
Some references to get our hand wet on this topic:
https://medium.com/@isuruboyagane.16/what-is-consensus-in-distributed-system-6d51d0802b8c
https://www.youtube.com/watch?v=5m3eBWKjHtM&ab_channel=HasgeekTV
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/3rwWhyv
The Data Mesh architectural design was first proposed in 2019 by Zhamak Dehghani, principal technology consultant at Thoughtworks, a technology company that is closely associated with the development of distributed agile methodology. A data mesh is a distributed, de-centralized data infrastructure in which multiple autonomous domains manage and expose their own data, called “data products,” to the rest of the organization.
Organizations leverage data mesh architecture when they experience shortcomings in highly centralized architectures, such as the lack domain-specific expertise in data teams, the inflexibility of centralized data repositories in meeting the specific needs of different departments within large organizations, and the slow nature of centralized data infrastructures in provisioning data and responding to changes.
In this session, Pablo Alvarez, Global Director of Product Management at Denodo, explains how data virtualization is your best bet for implementing an effective data mesh architecture.
You will learn:
- How data mesh architecture not only enables better performance and agility, but also self-service data access
- The requirements for “data products” in the data mesh world, and how data virtualization supports them
- How data virtualization enables domains in a data mesh to be truly autonomous
- Why a data lake is not automatically a data mesh
- How to implement a simple, functional data mesh architecture using data virtualization
Presentation on Data Mesh: The paradigm shift is a new type of eco-system architecture, which is a shift left towards a modern distributed architecture in which it allows domain-specific data and views “data-as-a-product,” enabling each domain to handle its own data pipelines.
Not to be confused with Oracle Database Vault (a commercial db security product), Data Vault Modeling is a specific data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for the last 10 years but is still not widely known or understood. The purpose of this presentation is to provide attendees with a detailed introduction to the technical components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics for how to build, and design structures when using the Data Vault modeling technique. The target audience is anyone wishing to explore implementing a Data Vault style data model for an Enterprise Data Warehouse, Operational Data Warehouse, or Dynamic Data Integration Store. See more content like this by following my blog http://kentgraziano.com or follow me on twitter @kentgraziano.
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data. How can you prevent this from happening? Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries. Is there one appliance that can support this modern data warehouse? Yes! It is the Analytics Platform System (APS) from Microsoft (formally called Parallel Data Warehouse or PDW) , which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1). In this session I will dig into the details of the modern data warehouse and APS. I will give an overview of the APS hardware and software architecture, identify what makes APS different, and demonstrate the increased performance. In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
A Work of Zhamak Dehghani
Principal consultant
ThoughtWorks
https://martinfowler.com/articles/data-monolith-to-mesh.html
https://fast.wistia.net/embed/iframe/vys2juvzc3?videoFoam
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Wonder what this data mesh stuff is all about? What are the principles of data mesh? Can you or should you consider data mesh as the approach for your analytics platform? And most important - how can Snowflake help?
Given in Montreal on 14-Dec-2021
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
Data Warehouse Design and Best PracticesIvo Andreev
A data warehouse is a database designed for query and analysis rather than for transaction processing. An appropriate design leads to scalable, balanced and flexible architecture that is capable to meet both present and long-term future needs. This session covers a comparison of the main data warehouse architectures together with best practices for the logical and physical design that support staging, load and querying.
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
Analytic workloads and the ability to determine “what happened” are some of the most common use cases across enterprises today - helping you understand and adapt based on changing trends. However, for most businesses today, they are only able to see a piece of the story. Analytics are limited by the amount of data able to be stored and ultimately accessed, it’s time-intensive to bring in new datasets or fit unstructured data into rigid schemas, and user access is constrained to a select few who must already know the questions they’re trying to answer.
It’s no surprise that big data is disrupting this modus operandi for analytics. A modern, Hadoop-based platform is designed to help businesses break free of these analytic limitations, providing a new kind of adaptive, high-performance analytic database. The recent release of Cloudera 5.8 continues to advance Cloudera Enterprise as the foundation for these analytic workloads.
Join Justin Erickson, Senior Director of Product Management at Cloudera, and Andy Frey, Chief Technology Officer at Marketing Associates, as they discuss:
-What technology is needed to build a modern analytic database with Hadoop
-What’s new with Cloudera 5.8
-How to align your teams around agile analytics
-Real world success from Marketing Associates
-What’s next for Cloudera Enterprise’s Analytic Database
I gave this presentation at the Advanced Architecture Conference, Bill Inmon, 2011 in Evergreen, Colorado. This presentation covers a new breed of data warehousing called Operational Data Warehousing. These are the next steps in business intelligence towards self-service BI and enabling users to do more with their enterprise data warehouse solution. Specifically, it talks about how the Data Vault model fits in to this picture.
If you would like to use the slides, please e-mail me first, I'd be happy to discuss it with you.
Consensus and Raft Algorithm in Distributed SystemThao Huynh Quang
A modern computing system requires multiple components distributed on different machines to provide scalability, high availability, fault tolerance, and low latency. Therefore, consensus is essential when communicating between nodes to agree on some data value required during computation.
There are many examples of consensus around our tooling:
- Google Chubby uses Paxos, which is a consensus algorithm.
- Kubernetes uses etcd as the backing store for all cluster data. Etcd uses Raft, which is a consensus algorithm.
- Hadoop; Kafka uses ZooKeeper for service discovery; leader’s election, ... ZooKeeper uses a Paxos-variant algorithm.
- Blockchain cannot exist without consensus algorithms such as Proof-of-work; Proof-of-stake; ...
As the result, knowledge of consensus in the distributed system is crucial to understand the behavior of those systems. This presentation will:
Brief introduction about the consensus; challenges and achieved result.
Raft algorithm - the main consensus algorithm used in most of the recent systems.
Some references to get our hand wet on this topic:
https://medium.com/@isuruboyagane.16/what-is-consensus-in-distributed-system-6d51d0802b8c
https://www.youtube.com/watch?v=5m3eBWKjHtM&ab_channel=HasgeekTV
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/3rwWhyv
The Data Mesh architectural design was first proposed in 2019 by Zhamak Dehghani, principal technology consultant at Thoughtworks, a technology company that is closely associated with the development of distributed agile methodology. A data mesh is a distributed, de-centralized data infrastructure in which multiple autonomous domains manage and expose their own data, called “data products,” to the rest of the organization.
Organizations leverage data mesh architecture when they experience shortcomings in highly centralized architectures, such as the lack domain-specific expertise in data teams, the inflexibility of centralized data repositories in meeting the specific needs of different departments within large organizations, and the slow nature of centralized data infrastructures in provisioning data and responding to changes.
In this session, Pablo Alvarez, Global Director of Product Management at Denodo, explains how data virtualization is your best bet for implementing an effective data mesh architecture.
You will learn:
- How data mesh architecture not only enables better performance and agility, but also self-service data access
- The requirements for “data products” in the data mesh world, and how data virtualization supports them
- How data virtualization enables domains in a data mesh to be truly autonomous
- Why a data lake is not automatically a data mesh
- How to implement a simple, functional data mesh architecture using data virtualization
Presentation on Data Mesh: The paradigm shift is a new type of eco-system architecture, which is a shift left towards a modern distributed architecture in which it allows domain-specific data and views “data-as-a-product,” enabling each domain to handle its own data pipelines.
Not to be confused with Oracle Database Vault (a commercial db security product), Data Vault Modeling is a specific data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for the last 10 years but is still not widely known or understood. The purpose of this presentation is to provide attendees with a detailed introduction to the technical components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics for how to build, and design structures when using the Data Vault modeling technique. The target audience is anyone wishing to explore implementing a Data Vault style data model for an Enterprise Data Warehouse, Operational Data Warehouse, or Dynamic Data Integration Store. See more content like this by following my blog http://kentgraziano.com or follow me on twitter @kentgraziano.
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data. How can you prevent this from happening? Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries. Is there one appliance that can support this modern data warehouse? Yes! It is the Analytics Platform System (APS) from Microsoft (formally called Parallel Data Warehouse or PDW) , which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1). In this session I will dig into the details of the modern data warehouse and APS. I will give an overview of the APS hardware and software architecture, identify what makes APS different, and demonstrate the increased performance. In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
A Work of Zhamak Dehghani
Principal consultant
ThoughtWorks
https://martinfowler.com/articles/data-monolith-to-mesh.html
https://fast.wistia.net/embed/iframe/vys2juvzc3?videoFoam
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Wonder what this data mesh stuff is all about? What are the principles of data mesh? Can you or should you consider data mesh as the approach for your analytics platform? And most important - how can Snowflake help?
Given in Montreal on 14-Dec-2021
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
Data Warehouse Design and Best PracticesIvo Andreev
A data warehouse is a database designed for query and analysis rather than for transaction processing. An appropriate design leads to scalable, balanced and flexible architecture that is capable to meet both present and long-term future needs. This session covers a comparison of the main data warehouse architectures together with best practices for the logical and physical design that support staging, load and querying.
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
Analytic workloads and the ability to determine “what happened” are some of the most common use cases across enterprises today - helping you understand and adapt based on changing trends. However, for most businesses today, they are only able to see a piece of the story. Analytics are limited by the amount of data able to be stored and ultimately accessed, it’s time-intensive to bring in new datasets or fit unstructured data into rigid schemas, and user access is constrained to a select few who must already know the questions they’re trying to answer.
It’s no surprise that big data is disrupting this modus operandi for analytics. A modern, Hadoop-based platform is designed to help businesses break free of these analytic limitations, providing a new kind of adaptive, high-performance analytic database. The recent release of Cloudera 5.8 continues to advance Cloudera Enterprise as the foundation for these analytic workloads.
Join Justin Erickson, Senior Director of Product Management at Cloudera, and Andy Frey, Chief Technology Officer at Marketing Associates, as they discuss:
-What technology is needed to build a modern analytic database with Hadoop
-What’s new with Cloudera 5.8
-How to align your teams around agile analytics
-Real world success from Marketing Associates
-What’s next for Cloudera Enterprise’s Analytic Database
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
The Shifting Landscape of Data IntegrationDATAVERSITY
Enterprises and organizations from every industry and scale are working to leverage data to achieve their strategic objectives — whether they are to be more profitable, effective, risk-tolerant, prepared, sustainable, and/or adaptable in an ever-changing world. Data has exploded in volume during the last decade as humans and machines alike produce data at an exponential pace. Also, exciting technologies have emerged around that data to improve our abilities and capabilities around what we can do with data.
Behind this data revolution, there are forces at work, causing enterprises to shift the way they leverage data and accelerate the demand for leverageable data. Organizations (and the climates in which they operate) are becoming more and more complex. They are also becoming increasingly digital and, thus, dependent on how data informs, transforms, and automates their operations and decisions. With increased digitization comes an increased need for both scale and agility at scale.
In this session, we have undertaken an ambitious goal of evaluating the current vendor landscape and assessing which platforms have made, or are in the process of making, the leap to this new generation of Data Management and integration capabilities.
The seminar is about Data warehousing, in here we are gonna discuss about what is data warehousing, comparison b/w database and data warehouse, different data warehouse models.about Data mart, and disadvantages of data warehousing.
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with HadoopPrecisely
With so many new, evolving frameworks, tools, and languages, a new big data project can lead to confusion and unwarranted risk.
Many organizations have found Data Warehouse Optimization with Hadoop to be a good starting point on their Big Data journey. Offloading ETL workloads from the enterprise data warehouse (EDW) into Hadoop is a well-defined use case that produces tangible results for driving more insights while lowering costs. You gain significant business agility, avoid costly EDW upgrades, and free up EDW capacity for faster queries. This quick win builds credibility and generates savings to reinvest in more Big Data projects.
A proven reference architecture that includes everything you need in a turnkey solution – the Hadoop distribution, data integration software, servers, networking and services – makes it even easier to get started.
Data it's big, so, grab it, store it, analyse it, make it accessible...mine, warehouse and visualise...use the pictures in your mind and others will see it your way!
Learn about the three advances in database technologies that eliminate the need for star schemas and the resulting maintenance nightmare.
Relational databases in the 1980s were typically designed using the Codd-Date rules for data normalization. It was the most efficient way to store data used in operations. As BI and multi-dimensional analysis became popular, the relational databases began to have performance issues when multiple joins were requested. The development of the star schema was a clever way to get around performance issues and ensure that multi-dimensional queries could be resolved quickly. But this design came with its own set of problems.
Unfortunately, the analytic process is never simple. Business users always think up unimaginable ways to query the data. And the data itself often changes in unpredictable ways. These result in the need for new dimensions, new and mostly redundant star schemas and their indexes, maintenance difficulties in handling slowly changing dimensions, and other problems causing the analytical environment to become overly complex, very difficult to maintain, long delays in new capabilities, resulting in an unsatisfactory environment for both the users and those maintaining it.
There must be a better way!
Watch this webinar to learn:
- The three technological advances in data storage that eliminate star schemas
- How these innovations benefit analytical environments
- The steps you will need to take to reap the benefits of being star schema-free
Modernising the data warehouse - January 2019Phil Watt
I was invited to present on Modernising the Data Warehouse to post-graduate students at the University of Melbourne in January 2019. These slides describe my experience and perspective on this topic that many, if not most, large organisations face. At Escient, we can help organisations navigate this area, and drive better outcomes from data.
It is a fascinating, explosive time for enterprise analytics.
It is from the position of analytics leadership that the mission will be executed and company leadership will emerge. The data professional is absolutely sitting on the performance of the company in this information economy and has an obligation to demonstrate the possibilities and originate the architecture, data, and projects that will deliver analytics. After all, no matter what business you’re in, you’re in the business of analytics.
The coming years will be full of big changes in enterprise analytics and Data Architecture. William will kick off the fourth year of the Advanced Analytics series with a discussion of the trends winning organizations should build into their plans, expectations, vision, and awareness now.
Nov 2014 talk to SW Data Meetup by Mike Olson, co-founder and chairman of Cloudera.
In business, we often deal with hype around trends in society, politics, economy and technology. We know we need to take claims of the next big thing with a grain of salt and that we should be careful not to set expectations too high. However, with Big Data analytics, the opposite is true. The hype that accompanies it actually conceals the enormity of its impact on the way we do business. In this talk I’ll discuss how new 'Data Driven' economies are emerging through relentless innovation across the public and private sectors.
Mike (co-founded Cloudera in 2008 and served as its CEO until 2013 when he took on his current role of chief strategy officer (CSO.) As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment and direct engagement with customers. Prior to Cloudera Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine. Mike spent two years at Oracle Corporation as vice president for Embedded Technologies after Oracle’s acquisition of Sleepycat in 2006. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies and Informix Software. Mike has a Bachelor’s and a Master’s Degree in Computer Science from the University of California, Berkeley.
Watch this webinar in full here: https://buff.ly/2MVTKqL
Self-Service BI promises to remove the bottleneck that exists between IT and business users. The truth is, if data is handed over to a wide range of data consumers without proper guardrails in place, it can result in data anarchy.
Attend this session to learn why data virtualization:
• Is a must for implementing the right self-service BI
• Makes self-service BI useful for every business user
• Accelerates any self-service BI initiative
Ethical AI at VDAB, presented by Vincent Buekenhout (Ethical AI Lead, VDAB) a...Patrick Van Renterghem
Vincent Buekenhout presented the various AI initiatives at VDAB, its AI4Good strategy, the way applications are designed, and most of all, the way ethics, measurements through KPI's, explainability and fairness play a role in this. Vincent also explained how ethics-by-design works at VDAB.
Implementing error-proof, business-critical Machine Learning, presentation by...Patrick Van Renterghem
This presentation by Deevid De Meyer outlines how Brainjar uses human-centric design and explainability to create machine learning systems that work together with humans to improve efficiency while reducing error rate.
Building Trust and Explainability into Chatbots: the Partena Ziekenfonds Busi...Patrick Van Renterghem
Chatbots and conversational interfaces are taking over customer service departments by storm. In many companies, they provide first-line support to customers. Based on the Partena Ziekenfonds business case, Karel Kremer shares a few critical success factors...
AI & Ethics: The Belgian Industry Vision & Initiatives, presentation by Jelle...Patrick Van Renterghem
Jelle Hoedemaekers (Agoria) explains why Belgian companies are working on ethical AI, and provides an overview of Belgian and European AI Initiatives with a focus on ethics.
Responsible AI: An Example AI Development Process with Focus on Risks and Con...Patrick Van Renterghem
Organisations need to make sure that they use AI in an appropriate way. Martijn and Hugo explain how to ensure that the developments are ethically sound and comply with regulations, how to have end-to-end governance, and how to address bias and fairness, interpretability and explainability, and robustness and security.
During the conference, we looked at an example AI development process with focussing on the risks to be managed and the controls that can be established.
Fairness and Transparency: Algorithmic Explainability, some Legal and Ethical...Patrick Van Renterghem
In this presentation, Nazanin Gifani discussed some of the ethical and legal issues of automated decision making, including algorithmic fairness, transparency and explainability. The big question here is: can AI help us to make fairer decisions ?
How obedient digital twins and intelligent beings contribute to ethics and ex...Patrick Van Renterghem
Paul Valckenaers explains how intelligence is added to a corresponding reality without introducing limitations into a world-of-interest. The outcome is obedience: a conflict with an obedient digital twin is a conflict with its real-world counterpart. Illustrated by healthcare examples.
He Said, She Said: Finding and Fixing Bias in NLP (Natural Language Processin...Patrick Van Renterghem
Yves Peirsman presents several instances where bias has posed a risk to the successful adoption of NLP systems, and discusses what techniques exist to discover these biases before the systems are put in production.
Introduction to Bias in Machine Learning, presented by Matthias Feys, CTO @ M...Patrick Van Renterghem
In this talk, Matthias Feys explains what bias in Machine Learning models actually means. You will get insights in the complexity of the problem and learn realistic ways to reduce bias.
Business Case: Ozitem Groupe, where 80% of the company is working remotely. R...Patrick Van Renterghem
Roxane Pasina (Ozitem's Chief Marketing and Communication Officer) explains and shows how Ozitem Groupe went in 1 year from an old Intranet to an interactive digital workplace allowing them to overcome their communication challenges, using the Jamespot digital workplace tools
Digital Workplace Case Study: How the Municipality of Duffel successfully swi...Patrick Van Renterghem
In 6 months time, the Gemeente/Municipality of Duffel has come quite close to transform into a forceful, digital local government thanks to the help of Synergics
Unleashing the Full Potential of People, Teams and SOLVAY, presented by Bruce...Patrick Van Renterghem
Bruce Fecheyr-Lippens (then SVP, Global Head Agile Working, Digital HR, People Analytics, and HR Director Excellence Center at Solvay) presented the digital workplace environment of Solvay #DWA19 #presentation #digitalworkplace #huapii
The Building Blocks of a Digital Workplace, presented by Sam Marshall at the ...Patrick Van Renterghem
Sam Marshall, manager of Clearbox Consulting, presented the key building blocks to fulfil the purpose of a digital workplace: to optimise the employee experience #DWA19 #presentation #digitalworkplace #DEX
Engie's Digital Workplace and "Connecting the company" business case, present...Patrick Van Renterghem
Jan Vanoudendycke (Director of Knowledge Management at Engie) presented the vision, roll-out and adoption process of the massive Engie Digital Workplace effort to connect everyone in the 150.000 people company #DWA19 #presentation #engie
Face your communication challenges when implementing a digital workplace, bas...Patrick Van Renterghem
Ellen Geens (ChangeLab) described the communication challenges, and gave tips and tricks for the change communication when implementing a digital workplace at their RIZIV and TVH customers
The first steps in Recticel's Digital Workplace program by Kenneth Meuleman (...Patrick Van Renterghem
Kenneth and Serge presented the first steps in Recticel's digital workplace program, and the managed Microsoft Teams and OneDrive for Business rollout #DWA19 #presentation #recticel
Presentation by Dave Geentjens at the "Successful Digital Workplace Adoption"...Patrick Van Renterghem
Dave Geentjens described the evolution of the Digital Workplace at the Flemish Government / Vlaamse Overheid: the challenges, the opportunities and the realisations so far #DWA19 #presentation #vlaamseoverheid
The central information provision layer within Argenta is the name for the central data hub, based on near real time Data Vault, which on one hand answers the information needs of the bank but also feeds applications such as MIFID. This layer is also the base from which the data governance is enforced. For this purpose they use Oracle Enterprise Metadata Manager and Collibra.
Presentation by Ivan Schotsmans (DV Community) at the Data Vault Modelling an...Patrick Van Renterghem
The start of GDPR implementations in Europe was, for most organizations, also the start of rethinking their Data Warehouse strategy. The experience of past implementations gave a better view on the do's and don'ts. One of the important lessons learned was the approach of handling information quality. It's not something you handle on top of your data warehouse. To be successful, information quality goes hand in hand with your data warehouse implementation.
Presentation by Luc Delanglez (DataLumen) at the Data Vault Modelling and Dat...Patrick Van Renterghem
During this session, Luc Delanglez provides some practical insights to get you started the right way on Traceability, Roles & responsibilities, Data stewardship, Data lineage and impact analysis, Critical functional components, Business Glossary, Data Catalog, Data Quality, Master/Reference Data Management, Compliancy & privacy, all very important aspects of data governance.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
1. Data Vault Fundamentals &
Best Practices
1
Erik Fransen, managingconsultant
+31 6 159 444 76
@erikfransen
2. Agenda
• Introduction
• Data Vault Basics
• Benefits & Challenges
• Best practices: Automation & Data
Virtualization
• Recommended reading
2
3. • Founded in 1998, The Hague, NL
• 40+ consultants
• Business Intelligence, Data Vault, Datawarehousing,
Datawarehouse Automation, Big Data, Data Virtualization
• Business & technical consultancy, end-to-end
implementation projects of Data Vault EDW, audits,
training, certification
• Wide range of customers (profit, non-profit) across various
industries
• Since 2009 Genesee Academy partner for Data Vault Day
and Data Vault Certification in NL, B & D
• Implementation partner of Cisco, MapR, Qlik & Tableau
4. The Data Vault modeling approach
Data Vault is a data modeling approach
…so it fits into the family of modeling approaches:
4
3rd Normal Form
Ensemble
Modeling
Dimensional
• While 3rd Normal Form is optimal for Operational Systems
…and Dimensional is optimal for Data Marts
…the Ensemble Modeling is optimal for the Datawarehouse
• And Data Vault is the leading form
of Ensemble Modeling
6. Why do we use Data Vault for DWH?
6
• When we need a DWH that supports:
– Integration
– Traceability
– History
– Incremental Build
– Agility
• Gracefully Adapts to New Sources
• Full Auditability - Source to Mart
• Enterprise View of Central Data
• Ready for Automation
Data Vault is specifically
designed for modelling the
EDW
7. The Data Vault Ensemble
7
• The Data Vault Ensemble conforms to a single key – embodied in the
Hub construct
• The parts for the Data Vault Ensemble only include:
– Hubs The Natural Business Keys
– Links The Natural Business Relationships
– Satellite s All Context, Descriptive Data and History of
Links and Hubs
“Separating thingsthat change from things that don’t change”
8. The Data Vault modeling approach
• As the scope of the EDW is expanded and new data sources added, the
Data Vault can adapt to these changes without impacting the existing
model
• This is what allows the EDW to be built incrementally and to adapt to
change without the need for re-engineering.
New Area absorbed
8
H_Cust
H_Sale
H_Empl
H_Store
H_Car
Tools for DWH Automation update the Data Vault
EDW (model + data) in a fast, agile & consistent way
9. • Business benefits
• Ability to adapt quickly to new business needs
• Data is traceable allowing for a fully auditable, integrated data store
• Allows the EDW to absorb all data all of the time
• Easily adapts to new data sources and changing business rules – without expensive re-
engineering
• Results in an Data Warehouse with lower total cost of ownership (TCO)
• Automation: short time to market, consist quality
• Project/development benefits
• Ideal for agile development techniques resulting in lower project risk and more
frequent deliverables
• Can be built incrementally without compromising the core architecture
• Automation: fast and incremental sprints, predictable costs
• Architectural benefits
• Parallel loading
• Data architecture that supports future expanded scope
• Can scale to virtually any size
• Ready for Automation: forces standardization
Data Vault Benefits
9
10. Data Vault Modeling Process
The Modeling Process for creating a Data Vault
model includes three primary steps:
1) Identify and Model the Core Business Concepts
• Business Interviews is at the heart of this step
What do you do? What are the main things you work with?
• Also find best/target Natural Business Key
2) Identify and Model the Natural Business Relationships
• Specific Unique Relationships
3) Analyze and Design the Context Satellites
• Consider Rate of Change, Type of Data and also the Sources of
your data during design process
10
Ideally the data vault is modelled based
on business processes and business
concepts
11. Getting data out of the Data Vault
• Problem:
– The Data Vault EDW is about data decomposition, data
registration and data integration
– Data Vault is not intended, nor designed or optimized for
data distribution and data consumption downstream the
EDW
– Leads typically to many complex physical data marts (high
maintenance, high cost)
• Solution:
– Start thinking differently: focus on creating functional data
products for the business
– Stop loading and replicating data physically, start using
data virtualization
11
12. Eliminate the need for physical data marts
No data replication
needed
Real-time data
refreshment
No redundant data
storage
Simple updates of
data models
Simple queries
Short Time to
Market
Automatic updates
Lower storage costs
High performance
Ready for Big Data
Data Vault
EDW
CRM
ERP
Weblog
s
…
Productio
n
Data
Data Copy
Steering
information
SQL
Data
Virtualization
Tool
+
Data
Abstraction
Layers
No Data Copy
at all
12
14. Wrap up
• Data Vault Basics:
– Hubs, Links, Satellites
– Integration, history, incremental modelling, agility
• Benefits:
– Business, project, architecture
– Make use of automation tools for fast, agile and consistent
delivery
• Challenges:
– Data downstream the data vault EDW
– Solution: use virtual data marts and automate SuperNova
data models for reporting & analytics
14
18. Recommend reading on Data Virtualization
Data Virtualization in Business Intelligence Architectures
• First independent book on data virtualization that
explains in a product-independent way how data
virtualization technology works.
• Illustrates concepts using examples developed with
commercially available products.
• Shows you how to solve common data integration
challenges such as data quality, system
interference, and overall performance by following
practical guidelines on using data virtualization.
• Apply data virtualization right away with three
chapters full of practical implementation guidance.
• Understand the big picture of data virtualization
and its relationship with data governance and
information management.
18
19. Data Vault Training & Certification
• CDVDM: March 31, April 1 2016 Amsterdam
• DVD: March 2, 2016 Diegem
• www.centennium-opleidingen.nl
• For all questions: opleidingen@centennium.nl
19
20. A short history on Data Vault
• 2002: First papers published by Dan Linstedt
• 2006: Start CDVDM certification program by Genesee
Academy
• 2007: Start of Data Vault EDW implementations
– Primarily in Europe (NL, S), some in USA
• 2008-2015: Several books published on DataVault by Dan
Linstedt, Hans Hultgren and others
• 2013: Data Vault on the radar in B, DACH, UK, USA, AUS,
NZ, Asia
• 2013: Data Vault EDW implementations going worldwide
• 2015: Over 900 CDVDM professionals and 750+ Data Vault
EDW worldwide
20