Not to be confused with Oracle Database Vault (a commercial db security product), Data Vault Modeling is a specific data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for the last 10 years but is still not widely known or understood. The purpose of this presentation is to provide attendees with a detailed introduction to the technical components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics for how to build, and design structures when using the Data Vault modeling technique. The target audience is anyone wishing to explore implementing a Data Vault style data model for an Enterprise Data Warehouse, Operational Data Warehouse, or Dynamic Data Integration Store. See more content like this by following my blog http://kentgraziano.com or follow me on twitter @kentgraziano.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Data Vault Modeling and Methodology introduction that I provided to a Montreal event in September 2011. It covers an introduction and overview of the Data Vault components for Business Intelligence and Data Warehousing. I am Dan Linstedt, the author and inventor of Data Vault Modeling and methodology.
If you use the images anywhere in your presentations, please credit http://LearnDataVault.com as the source (me).
Thank-you kindly,
Daniel Linstedt
This is a presentation I gave in 2006 for Bill Inmon. The presentation covers Data Vault and how it integrates with Bill Inmon's DW2.0 vision. This is focused on the business intelligence side of the house.
IF you want to use these slides, please put (C) Dan Linstedt, all rights reserved, http://LearnDataVault.com
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Data Vault Modeling and Methodology introduction that I provided to a Montreal event in September 2011. It covers an introduction and overview of the Data Vault components for Business Intelligence and Data Warehousing. I am Dan Linstedt, the author and inventor of Data Vault Modeling and methodology.
If you use the images anywhere in your presentations, please credit http://LearnDataVault.com as the source (me).
Thank-you kindly,
Daniel Linstedt
This is a presentation I gave in 2006 for Bill Inmon. The presentation covers Data Vault and how it integrates with Bill Inmon's DW2.0 vision. This is focused on the business intelligence side of the house.
IF you want to use these slides, please put (C) Dan Linstedt, all rights reserved, http://LearnDataVault.com
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Big data architectures and the data lakeJames Serra
With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/3rwWhyv
The Data Mesh architectural design was first proposed in 2019 by Zhamak Dehghani, principal technology consultant at Thoughtworks, a technology company that is closely associated with the development of distributed agile methodology. A data mesh is a distributed, de-centralized data infrastructure in which multiple autonomous domains manage and expose their own data, called “data products,” to the rest of the organization.
Organizations leverage data mesh architecture when they experience shortcomings in highly centralized architectures, such as the lack domain-specific expertise in data teams, the inflexibility of centralized data repositories in meeting the specific needs of different departments within large organizations, and the slow nature of centralized data infrastructures in provisioning data and responding to changes.
In this session, Pablo Alvarez, Global Director of Product Management at Denodo, explains how data virtualization is your best bet for implementing an effective data mesh architecture.
You will learn:
- How data mesh architecture not only enables better performance and agility, but also self-service data access
- The requirements for “data products” in the data mesh world, and how data virtualization supports them
- How data virtualization enables domains in a data mesh to be truly autonomous
- Why a data lake is not automatically a data mesh
- How to implement a simple, functional data mesh architecture using data virtualization
Data Warehouse Design and Best PracticesIvo Andreev
A data warehouse is a database designed for query and analysis rather than for transaction processing. An appropriate design leads to scalable, balanced and flexible architecture that is capable to meet both present and long-term future needs. This session covers a comparison of the main data warehouse architectures together with best practices for the logical and physical design that support staging, load and querying.
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingKent Graziano
This is a presentation I gave at OUGF14 in Helsinki, Finland.
Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for the last 10 years but is still not widely known or understood. The purpose of this presentation is to provide attendees with a detailed introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics for how to build, and design structures incrementally, without constant refactoring, when using the Data Vault modeling technique. This technique works well for:
• Building the Enterprise Data Warehouse repository in a CIF architecture
• Building a Persistent Staging Area (PSA) in a Kimball Bus Architecture
• Building your data model incrementally, one sprint at a time using a repeatable technique
• Providing a model that is easily extensible without need to re-engineer existing structure or load processes
Presentation on Data Mesh: The paradigm shift is a new type of eco-system architecture, which is a shift left towards a modern distributed architecture in which it allows domain-specific data and views “data-as-a-product,” enabling each domain to handle its own data pipelines.
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
Dragan Berić will take a deep dive into Lakehouse architecture, a game-changing concept bridging the best elements of data lake and data warehouse. The presentation will focus on the Delta Lake format as the foundation of the Lakehouse philosophy, and Databricks as the primary platform for its implementation.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
I gave this presentation at the Advanced Architecture Conference, Bill Inmon, 2011 in Evergreen, Colorado. This presentation covers a new breed of data warehousing called Operational Data Warehousing. These are the next steps in business intelligence towards self-service BI and enabling users to do more with their enterprise data warehouse solution. Specifically, it talks about how the Data Vault model fits in to this picture.
If you would like to use the slides, please e-mail me first, I'd be happy to discuss it with you.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Agile & Data Modeling – How Can They Work Together?DATAVERSITY
A tenet of the Agile Manifesto is ‘Working software over comprehensive documentation’, and many have interpreted that to mean that data models are not necessary in the agile development environment. Others have seen the value of data models for achieving the other core tenets of ‘Customer Collaboration’ and ‘Responding to Change’.
This webinar will discuss how data models are being effectively used in today’s Agile development environment and the benefits that are being achieved from this approach.
Databricks: A Tool That Empowers You To Do More With DataDatabricks
In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
Given at Oracle Open World 2011: Not to be confused with Oracle Database Vault (a commercial db security product), Data Vault Modeling is a specific data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It has been in use globally for over 10 years now but is not widely known. The purpose of this presentation is to provide an overview of the features of a Data Vault modeled EDW that distinguish it from the more traditional third normal form (3NF) or dimensional (i.e., star schema) modeling approaches used in most shops today. Topics will include dealing with evolving data requirements in an EDW (i.e., model agility), partitioning of data elements based on rate of change (and how that affects load speed and storage requirements), and where it fits in a typical Oracle EDW architecture. See more content like this by following my blog http://kentgraziano.com or follow me on twitter @kentgraziano.
DAMA, Oregon Chapter, 2012 presentation - an introduction to Data Vault modeling. I will be covering parts of the methodology, comparison and contrast of issues in general for the EDW space. Followed by a brief technical introduction of the Data Vault modeling method.
After the presentation i I will be providing a demonstration of the ETL loading layers, LIVE!
You can find more on-line training at: http://LearnDataVault.com/training
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Big data architectures and the data lakeJames Serra
With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/3rwWhyv
The Data Mesh architectural design was first proposed in 2019 by Zhamak Dehghani, principal technology consultant at Thoughtworks, a technology company that is closely associated with the development of distributed agile methodology. A data mesh is a distributed, de-centralized data infrastructure in which multiple autonomous domains manage and expose their own data, called “data products,” to the rest of the organization.
Organizations leverage data mesh architecture when they experience shortcomings in highly centralized architectures, such as the lack domain-specific expertise in data teams, the inflexibility of centralized data repositories in meeting the specific needs of different departments within large organizations, and the slow nature of centralized data infrastructures in provisioning data and responding to changes.
In this session, Pablo Alvarez, Global Director of Product Management at Denodo, explains how data virtualization is your best bet for implementing an effective data mesh architecture.
You will learn:
- How data mesh architecture not only enables better performance and agility, but also self-service data access
- The requirements for “data products” in the data mesh world, and how data virtualization supports them
- How data virtualization enables domains in a data mesh to be truly autonomous
- Why a data lake is not automatically a data mesh
- How to implement a simple, functional data mesh architecture using data virtualization
Data Warehouse Design and Best PracticesIvo Andreev
A data warehouse is a database designed for query and analysis rather than for transaction processing. An appropriate design leads to scalable, balanced and flexible architecture that is capable to meet both present and long-term future needs. This session covers a comparison of the main data warehouse architectures together with best practices for the logical and physical design that support staging, load and querying.
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingKent Graziano
This is a presentation I gave at OUGF14 in Helsinki, Finland.
Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for the last 10 years but is still not widely known or understood. The purpose of this presentation is to provide attendees with a detailed introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics for how to build, and design structures incrementally, without constant refactoring, when using the Data Vault modeling technique. This technique works well for:
• Building the Enterprise Data Warehouse repository in a CIF architecture
• Building a Persistent Staging Area (PSA) in a Kimball Bus Architecture
• Building your data model incrementally, one sprint at a time using a repeatable technique
• Providing a model that is easily extensible without need to re-engineer existing structure or load processes
Presentation on Data Mesh: The paradigm shift is a new type of eco-system architecture, which is a shift left towards a modern distributed architecture in which it allows domain-specific data and views “data-as-a-product,” enabling each domain to handle its own data pipelines.
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
Dragan Berić will take a deep dive into Lakehouse architecture, a game-changing concept bridging the best elements of data lake and data warehouse. The presentation will focus on the Delta Lake format as the foundation of the Lakehouse philosophy, and Databricks as the primary platform for its implementation.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
I gave this presentation at the Advanced Architecture Conference, Bill Inmon, 2011 in Evergreen, Colorado. This presentation covers a new breed of data warehousing called Operational Data Warehousing. These are the next steps in business intelligence towards self-service BI and enabling users to do more with their enterprise data warehouse solution. Specifically, it talks about how the Data Vault model fits in to this picture.
If you would like to use the slides, please e-mail me first, I'd be happy to discuss it with you.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Agile & Data Modeling – How Can They Work Together?DATAVERSITY
A tenet of the Agile Manifesto is ‘Working software over comprehensive documentation’, and many have interpreted that to mean that data models are not necessary in the agile development environment. Others have seen the value of data models for achieving the other core tenets of ‘Customer Collaboration’ and ‘Responding to Change’.
This webinar will discuss how data models are being effectively used in today’s Agile development environment and the benefits that are being achieved from this approach.
Databricks: A Tool That Empowers You To Do More With DataDatabricks
In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
Given at Oracle Open World 2011: Not to be confused with Oracle Database Vault (a commercial db security product), Data Vault Modeling is a specific data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It has been in use globally for over 10 years now but is not widely known. The purpose of this presentation is to provide an overview of the features of a Data Vault modeled EDW that distinguish it from the more traditional third normal form (3NF) or dimensional (i.e., star schema) modeling approaches used in most shops today. Topics will include dealing with evolving data requirements in an EDW (i.e., model agility), partitioning of data elements based on rate of change (and how that affects load speed and storage requirements), and where it fits in a typical Oracle EDW architecture. See more content like this by following my blog http://kentgraziano.com or follow me on twitter @kentgraziano.
DAMA, Oregon Chapter, 2012 presentation - an introduction to Data Vault modeling. I will be covering parts of the methodology, comparison and contrast of issues in general for the EDW space. Followed by a brief technical introduction of the Data Vault modeling method.
After the presentation i I will be providing a demonstration of the ETL loading layers, LIVE!
You can find more on-line training at: http://LearnDataVault.com/training
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn:
- Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart
- Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
- Step by step approach to building an effective data warehouse architecture
- Common reasons for the failure of data warehouse implementations and how to avoid them
Agile Data Rationalization for Operational IntelligenceInside Analysis
The Briefing Room with Eric Kavanagh and Phasic Systems
Live Webcast Mar. 26, 2013
The complexity of today's information architectures creates a wide range of challenges for executives trying to get a strategic view of their current operations. The data and context locked in operational systems often get diluted during the normalization processes of data warehousing and other types of analytic solutions. And the ultimate goal of seeing the big picture gets derailed by a basic inability to reconcile disparate organizational views of key information assets and rules.
Register for this episode of The Briefing Room to learn from Bloor Group CEO Eric Kavanagh, who will explain how a tightly controlled methodology can be combined with modern NoSQL technology to resolve both process and system complexities, thus enabling a much richer, more interconnected information landscape. Kavanagh will be briefed by Geoffrey Malafsky of Phasic Systems who will share his company's tested methodology for capturing and managing the business and process logic that run today's data-driven organizations. He'll demonstrate how a “don't say no” approach to entity definitions can dissolve previously intractable disagreements, opening the door to clear, verifiable operational intelligence.
Visit: http://www.insideanalysis.com
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Kent Graziano
(updated slides used for North Texas DAMA meetup Oct 2018) As we move more and more towards the need for everyone to do Agile Data Warehousing, we need a data modeling method that can be agile with us. Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for over 15 years and is now growing in popularity. The purpose of this presentation is to provide attendees with an introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics:
• What the basic components of a DV model are
• How to build, and design structures incrementally, without constant refactoring
Part 2 of a 2 part presentation that I did in 2009, this presentation covers more about unstructured data, and operational data vault components. YES, even then I was commenting on how this market will evolve. IF you want to use these slides, please let me know, and add: "(C) Dan Linstedt, all rights reserved, http://LearnDataVault.com" in a VISIBLE fashion on your slides.
Data it's big, so, grab it, store it, analyse it, make it accessible...mine, warehouse and visualise...use the pictures in your mind and others will see it your way!
All Grown Up: Maturation of Analytics in the CloudInside Analysis
The Briefing Room with Wayne Eckerson and Birst
Live Webcast on Nov. 6, 2012
The desire for analytics today extends far beyond the traditional domain of Business Intelligence. The challenge is that operational systems come in countless shapes and sizes. Furthermore, each application treats data somewhat differently. But there are patterns of data flow and transformation that pervade all such systems. And there's one big place where all these data types and use cases have come together architecturally: the Cloud.
Watch this episode of the Briefing Room to hear veteran Analyst Wayne Eckerson explain how Cloud computing is ushering in a new era of analytics and intelligence. He'll be briefed by Brad Peters of Birst who will tout his company's purpose-built analytics platform. He'll discuss how the Birst engine processes and delivers raw data from disparate systems, offering the deployment flexibility of Software-as-a-Service, together with the capabilities of enterprise-class BI.
Businesses cannot compete without data. Every organization produces and consumes it. Data trends are hitting the mainstream and businesses are adopting buzzwords such as Big Data, data vault, data scientist, etc., to seek solutions for their fundamental data issues. Few realize that the importance of any solution, regardless of platform or technology, relies on the data model supporting it. Data modeling is not an optional task for an organization’s data remediation effort. Instead, it is a vital activity that supports the solution driving your business.
This webinar will address emerging trends around data model application methodology, as well as trends around the practice of data modeling itself. We will discuss abstract models and entity frameworks, as well as the general shift from data modeling being segmented to becoming more integrated with business practices.
Takeaways:
How are anchor modeling, data vault, etc. different and when should I apply them?
Integrating data models to business models and the value this creates
Application development (Data first, code first, object first)
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...Cambridge Semantics
Only with a rich and interactive semantic layer can your data and analytics stack deliver true on-demand access to data, answers and insights - weaving data together from across the enterprise into an information fabric. In this webinar we introduce Anzo Smart Data Lake 4.0, which provides that rich and interactive semantic layer to your data.
From Business Intelligence to Big Data - hack/reduce Dec 2014Adam Ferrari
Talk given on Dec. 3, 2014 at MIT, sponsored by Hack/Reduce. This talk looks at the history of Business Intelligence from first generation OLAP tools through modern Data Discovery and visualization tools. And looking forward, what can we learn from that evolution as numerous new tools and architectures for analytics emerge in the Big Data era.
Balance agility and governance with #TrueDataOps and The Data CloudKent Graziano
DataOps is the application of DevOps concepts to data. The DataOps Manifesto outlines WHAT that means, similar to how the Agile Manifesto outlines the goals of the Agile Software movement. But, as the demand for data governance has increased, and the demand to do “more with less” and be more agile has put more pressure on data teams, we all need more guidance on HOW to manage all this. Seeing that need, a small group of industry thought leaders and practitioners got together and created the #TrueDataOps philosophy to describe the best way to deliver DataOps by defining the core pillars that must underpin a successful approach. Combining this approach with an agile and governed platform like Snowflake’s Data Cloud allows organizations to indeed balance these seemingly competing goals while still delivering value at scale.
Given in Montreal on 14-Dec-2021
Wonder what this data mesh stuff is all about? What are the principles of data mesh? Can you or should you consider data mesh as the approach for your analytics platform? And most important - how can Snowflake help?
Given in Montreal on 14-Dec-2021
HOW TO SAVE PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...Kent Graziano
A good data model, done right the first time, can save you time and money. We have all seen the charts on the increasing cost of finding a mistake/bug/error late in a software development cycle. Would you like to reduce, or even eliminate, your risk of finding one of those errors late in the game? Of course you would! Who wouldn't? Nobody plans to miss a requirement or make a bad design decision (well nobody sane anyway). No data modeler or database designer worth their salt wants to leave a model incomplete or incorrect. So what can you do to minimize the risk?
In this talk I will show you a best practice approach to developing your data models and database designs that I have been using for over 15 years. It is a simple, repeatable process for reviewing your data models. It is one that even a non-modeler could follow. I will share my checklist of what to look for and what to ask the data modeler (or yourself) to make sure you get the best possible data model. As a bonus I will share how I use SQL Developer Data Modeler (a no-cost data modeling tool) to collect the information and report it.
This talk will introduce you to the Data Cloud, how it works, and the problems it solves for companies across the globe and across industries. The Data Cloud is a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Wherever data or users live, Snowflake delivers a single and seamless experience across multiple public clouds. Snowflake’s platform is the engine that powers and provides access to the Data Cloud
Delivering Data Democratization in the Cloud with SnowflakeKent Graziano
This is a brief introduction to Snowflake Cloud Data Platform and our revolutionary architecture. It contains a discussion of some of our unique features along with some real world metrics from our global customer base.
Demystifying Data Warehousing as a Service (GLOC 2019)Kent Graziano
Extended deck from the 2019 GLOC event in Cleveland. Discusses what a DWaaS is, the top 10 features of Snowflake that represent that, and a check list for what questions to ask when choosing a cloud based data warehouse.
[Given at DAMA WI, Nov 2018] With the increasing prevalence of semi-structured data from IoT devices, web logs, and other sources, data architects and modelers have to learn how to interpret and project data from things like JSON. While the concept of loading data without upfront modeling is appealing to many, ultimately, in order to make sense of the data and use it to drive business value, we have to turn that schema-on-read data into a real schema! That means data modeling! In this session I will walk through both simple and complex JSON documents, decompose them, then turn them into a representative data model using Oracle SQL Developer Data Modeler. I will show you how they might look using both traditional 3NF and data vault styles of modeling. In this session you will:
1. See what a JSON document looks like
2. Understand how to read it
3. Learn how to convert it to a standard data model
Extreme BI: Creating Virtualized Hybrid Type 1+2 DimensionsKent Graziano
From a talk I gave at WWDVC and ECO in 2015 about how we built virtual dimensions (views) on a data vault-style data warehouse (see Data Warehousing in the Real World for full details on that architecture)
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
This is from the talk I gave at the 30th Anniversary NoCOUG meeting in San Jose, CA.
We all know that data warehouses and best practices for them are changing dramatically today. As organizations build new data warehouses and modernize established ones, they are turning to Data Warehousing as a Service (DWaaS) in hopes of taking advantage of the performance, concurrency, simplicity, and lower cost of a SaaS solution or simply to reduce their data center footprint (and the maintenance that goes with that).
But what is a DWaaS really? How is it different from traditional on-premises data warehousing?
In this talk I will:
• Demystify DWaaS by defining it and its goals
• Discuss the real-world benefits of DWaaS
• Discuss some of the coolest features in a DWaaS solution as exemplified by the Snowflake Elastic Data Warehouse.
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSKent Graziano
(This is the talk I gave at Houston DAMA and Agile Denver BI meetups)
At a past client, in order to meet timelines to fulfill urgent, unmet reporting needs, I found it necessary to build a virtualized Operational Data Store as the first phase of a new Data Vault 2.0 project. This allowed me to deliver new objects, quickly and incrementally to the report developer so we could quickly show the business users their data. In order to limit the need for refactoring in later stages of the data warehouse development, I chose to build this virtualization layer on top of a Type 2 persistent staging layer. All of this was done using Oracle SQL Developer Data Modeler (SDDM) against (gasp!) a MS SQL Server Database. In this talk I will show you the architecture for this approach, the rationale, and then the tricks I used in SDDM to build all the stage tables and views very quickly. In the end you will see actual SQL code for a virtual ODS that can easily be translated to an Oracle database.
Agile Data Engineering - Intro to Data Vault Modeling (2016)Kent Graziano
(Updated deck) As we move more and more towards the need for everyone to do Agile Data Warehousing, we need a data modeling method that can be agile with us. Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for over 10 years but is still not widely known or understood. The purpose of this presentation is to provide attendees with an introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics:
• What the basic components of a DV model are
• How to build, and design structures incrementally, without constant refactoring
Agile Methods and Data Warehousing (2016 update)Kent Graziano
This presentation takes a look at the Agile Manifesto and the 12 Principles of Agile Development and discusses how these apply to Data Warehousing and Business Intelligence projects. Several examples and details from my past experience are included. Includes more details on using Data Vault as well. (I gave this presentation at OUGF14 in Helsinki, Finland and again in 2016 for TDWI Nashville.)
These are the slides from my talk at Data Day Texas 2016 (#ddtx16).
The world of data warehousing has changed! With the advent of Big Data, Streaming Data, IoT, and The Cloud, what is a modern data management professional to do? It may seem to be a very different world with different concepts, terms, and techniques. Or is it? Lots of people still talk about having a data warehouse or several data marts across their organization. But what does that really mean today in 2016? How about the Corporate Information Factory (CIF), the Data Vault, an Operational Data Store (ODS), or just star schemas? Where do they fit now (or do they)? And now we have the Extended Data Warehouse (XDW) as well. How do all these things help us bring value and data-based decisions to our organizations? Where do Big Data and the Cloud fit? Is there a coherent architecture we can define? This talk will endeavor to cut through the hype and the buzzword bingo to help you figure out what part of this is helpful. I will discuss what I have seen in the real world (working and not working!) and a bit of where I think we are going and need to go in 2016 and beyond.
Worst Practices in Data Warehouse DesignKent Graziano
This presentation was given at OakTable World 2014 (#OTW14) in San Francisco. After many years of designing data warehouses and consulting on data warehouse architectures, I have seen a lot of bad design choices by supposedly experienced professional. A sense of professionalism, confidentiality agreements, and some sense of common decency have prevented me from calling people out on some of this. No more! In this session I will walk you through a typical bad design like many I have seen. I will show you what I see when I reverse engineer a supposedly complete design and walk through what is wrong with it and discuss options to correct it. This will be a test of your knowledge of data warehouse best practices by seeing if you can recognize these worst practices.
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureKent Graziano
This presentation was given at OakTable World 2014 (#OTW14) in San Francisco as a short Ted-style 10 minute talk. In it I introduce Data Vault 2.0 and its innovative approach to doing change data capture in a data warehouse by using MD5 Hash columns.
I gave this presentation at OUGF14 in Helsinki, Finland and again for TDWI Nashville. This presentation takes a look at the Agile Manifesto and the 12 Principles of Agile Development and discusses how these apply to Data Warehousing and Business Intelligence projects. Several examples and details from my past experience are included.
Top Five Cool Features in Oracle SQL Developer Data ModelerKent Graziano
This is the presentation I gave at OUGF14 in Helsinki, Finland in June 2014.
Oracle SQL Developer Data Modeler (SDDM) has been around for a few years now and is up to version 4.x. It really is an industrial strength data modeling tool that can be used for any data modeling task you need to tackle. Over the years I have found quite a few features and utilities in the tool that I rely on to make me more efficient (and agile) in developing my models. This presentation will demonstrate at least five of these features, tips, and tricks for you. I will walk through things like modifying the delivered reporting templates, how to create and applying object naming templates, how to use a table template and transformation script to add audit columns to every table, and using the new meta data export tool and several other cool things you might not know are there. Since there will likely be patches and new releases before the conference, there is a good chance there will be some new things for me to show you as well. This might be a bit of a whirlwind demo, so get SDDM installed on your device and bring it to the session so you can follow along.
(OTW13) Agile Data Warehousing: Introduction to Data Vault ModelingKent Graziano
This is the presentation I gave at OakTable World 2013 in San Francisco. #OTW13 was held at the Children's Creativity Museum next to the Moscone Convention Center and was in parallel with Oracle OpenWorld 2013.
The session discussed our attempts to be more agile in designing enterprise data warehouses and how the Data Vault Data Modeling technique helps in that approach.
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachKent Graziano
First we interview the users, then we design a reporting model based on those interviews. We follow that up with mounds of ETL development to load the new model, basically keeping the user community in the dark during all that development. Does this sound familiar?
This presentation will demonstrate an alternative approach using the Data Vault Data Modeling technique to build a flexible, easily-extensible “Foundation” layer in our data warehouse with an Agile, iterative methodology. Relying on the Business Model and Mapping (BMM) functionality of OBIEE, we can rapidly virtualize a dimensional reporting model using the pattern-based Data Vault Foundation layer to decrease the time, and money, it takes to get BI content in front of end users. Attendees will see a sample Data Vault model designed iteratively and deployed to the semantic model of OBIEE.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 4
Introduction to Data Vault Modeling
1. Introduction to Data Vault
Modeling
Kent Graziano
Data Vault Master and Oracle ACE
TrueBridge Resources
OOW 2011
Session #05923
2. My Bio
• Kent Graziano
– Certified Data Vault Master
– Oracle ACE (BI/DW)
– Data Architecture and Data Warehouse Specialist
• 30 years in IT
• 20 years of Oracle-related work
• 15+ years of data warehousing experience
– Co-Author of
• The Business of Data Vault Modeling (2008)
• The Data Model Resource Book (1st Edition)
• Oracle Designer: A Template for Developing an Enterprise
Standards Document
– Past-President of Oracle Development Tools User Group
(ODTUG) and Rocky Mountain Oracle User Group
– Co-Chair BIDW SIG for ODTUG
(C) Kent Graziano
4. What Is a Data Warehouse?
“A subject-oriented, integrated, time-variant,
non-volatile collection of data in support of
management’s decision making process.”
W.H. Inmon
“The data warehouse is where we publish
used data.”
Ralph Kimball
(C) Kent Graziano
5. Inmon’s Definition
• Subject oriented
– Developed around logical data groupings (subject areas)
not business functions
• Integrated
– Common definitions and formats from multiple systems
• Time-variant
– Contains historical view of data
• Non-volatile
– Does not change over time
– No updates
(C) Kent Graziano
6. Data Vault Definition
The Data Vault is a detail oriented, historical
tracking and uniquely linked set of normalized
tables that support one or more functional areas
of business.
It is a hybrid approach encompassing the best of
breed between 3rd normal form (3NF) and star
schema. The design is flexible, scalable, consistent,
and adaptable to the needs of the enterprise. It is a
data model that is architected specifically to meet
the needs of today’s enterprise data warehouses.
Dan Linstedt: Defining the Data Vault
TDAN.com Article
(C) TeachDataVault.com
7. Why Bother With Something New?
Old Chinese proverb:
'Unless you change direction, you're apt to
end up where you're headed.'
(C) TeachDataVault.com
8. Why do we need it?
• We have seen issues in constructing (and
managing) an enterprise data warehouse model
using 3rd normal form, or Star Schema.
– 3NF – Complex PKs when cascading snapshot
dates (time-driven PKs)
– Star – difficult to re-engineer fact tables for
granularity changes
• These issues lead to break downs in
flexibility, adaptability, and even scalability
(C) Kent Graziano
9. Data Vault Time Line
E.F. Codd invented 1976 Dr Peter Chen 1990 – Dan Linstedt
relational modeling Created E-R Begins R&D on Data
Diagramming Vault Modeling
Chris Date and
Hugh Darwen Mid 70’s AC Nielsen
Maintained and Popularized
Refined Modeling Dimension & Fact Terms
1960 1970 1980 1990 2000
Late 80’s – Barry Devlin
Early 70’s Bill Inmon and Dr Kimball Release
Began Discussing “Business Data
Data Warehousing Warehouse”
Mid 80’s Bill Inmon
Popularizes Data
Mid 60’s Dimension & Fact Warehousing
Modeling presented by General 2000 – Dan Linstedt
Mills and Dartmouth University Mid – Late 80’s Dr Kimball releases first 5 articles
Popularizes Star Schema on Data Vault Modeling
(C) TeachDataVault.com
10. Data Vault Evolution
• The work on the Data Vault approach began in the early
1990s, and completed around 1999.
• Throughout 1999, 2000, and 2001, the Data Vault design was
tested, refined, and deployed into specific customer sites.
• In 2002, the industry thought leaders were asked to review
the architecture.
– This is when I attend my first DV seminar in Denver and met Dan!
• In 2003, Dan began teaching the modeling techniques to the
mass public.
(C) Kent Graziano
15. Hub and Spoke = Scalability
http://www.nature.com/ng/journal/v29/n2/full/ng1001-105.html
If nature uses Hub & Spoke, why shouldn’t we?
Genetics scale to billions of cells,
the Data Vault scales to Billions of records
(C) TeachDataVault.com 15
16. Hubs = Neurons
Hub
Very similar to a neural network,
The Hubs create the base structure
(C) TeachDataVault.com
17. Links = Dendrite + Synapse
In neural networks,
Dendrites & Synapses fire to pass messages,
The Links dictate associations, connections
(C) TeachDataVault.com
18. Satellites = Memories
Perception, understanding and processing
These all describe the memory
Satellites house descriptors that can change over time
(C) TeachDataVault.com
19. National Drug Codes + Orange Book of Drug Patent Applications
A WORKING EXAMPLE
http://www.accessdata.fda.gov/scripts/cder/ndc/default.cfm
http://www.fda.gov/Drugs/InformationOnDrugs/ucm129662.htm
(C) TeachDataVault.com
20. 1. Hub = Business Keys
Product Number
Drug Label Code
NDA Application #
Firm Name
Dose Form Code
Drug Listing
Patent Number
Patent Use Code
Hubs = Unique Lists of Business Keys
Business Keys are used to
TRACK and IDENTIFY key information
(C) TeachDataVault.com
21. Business Keys = Ontology
Firm Name
Business Keys should be
Drug Listing arranged in an ontology
In order to learn the
Product Number
dependencies of the data
Dose Form Code set
NDA Application #
Drug Label Code
Patent Number
Patent Use Code
NOTE: Different Ontologies represent different views of the data!
(C) TeachDataVault.com
22. Hub Entity
A Hub is a list of unique business keys.
Hub Structure Hub Product
Primary Key Product Sequence ID
Unique Index
<Business Key> Product Number
(Primary Index)
Load DTS Product Load DTS
Record Source Prod Record Source
Note:
• A Hub’s Business Key is a unique index.
• A Hub’s Load Date represents the FIRST TIME the EDW saw the data.
• A Hub’s Record Source represents: First – the “Master” data source (on collisions), if
not available, it holds the origination source of the actual key.
(C) TeachDataVault.com
23. Business Keys
• What exactly are Business Keys?
– Example 1:
• Siebel has a “system generated” customer key
• Oracle Financials has a “system generated” customer key
• These are not business keys. These are keys used by each respective
system to track records.
– Example 2:
• Siebel Tracks customer name, and address as unique elements.
• Oracle Financials tracks name, and address as unique elements.
• These are business keys.
• What we want in the hub, are sets of natural business keys
that uniquely identify the data – across systems.
• Stay away from “system generated” keys if possible.
– System Generated keys will cause damage in the integration cycle if they are
not unique across the enterprise.
(C) TeachDataVault.com
24. Hub Definition
• What Makes a Hub Key?
– A Hub is based on an identifiable business key.
– An identifiable business key is an attribute that is used in
the source systems to locate data.
– The business key has a very low propensity to change, and
usually is not editable on the source systems.
– The business key has the same semantic meaning, and the
same granularity across the company, but not necessarily
the same format.
• Attributes and Ordering
– All attributes are mandatory.
– Sequence ID 1st, Busn. Key 2nd , Load Date 3rd ,Record
Source Last (4th).
– All attributes in the Business Key form a UNIQUE Index.
(C) TeachDataVault.com
25. The technical objective of the Hub is to:
• Uniquely list all possible business keys, good, bad, or indifferent of
where they originated.
• Tie the business keys in a 1:1 ratio with surrogate keys (giving
meaning to the surrogate generated sequences).
• Provide a consolidation and attribution layer for clear horizontal
definition of the business functionality.
• Track the arrival of data, the first time it appears in the warehouse.
• Provide right-time / real-time systems the ability to load
transactions without descriptive data.
(C) TeachDataVault.com
26. Hub Table Structures
SQN = Sequence (insertion order)
LDTS = Load Date (when the Warehouse first sees the data)
RSRC = Record Source (System + App where the data ORIGINATED)
(C) TeachDataVault.com
27. Sample Hub Product
ID PRODUCT # LOAD DTS RCRD SRC
1 MFG-PRD123456 6-1-2000 MANUFACT
2 P1235 6-2-2000 CONTRACTS
3 *P1235 2-15-2001 CONTRACTS
4 MFG-1235 5-17-2001 MANUFACT
5 1235-MFG 7-14-2001 FINANCE
6 1235 10-13-2001 FINANCE
7 PRD128582 4-12-2002 MANUFACT
8 PRD125826 4-12-2002 MANUFACT
9 PRD128256 4-12-2002 MANUFACT
10 PRD929929-* 4-12-2002 MANUFACT
Unique
Index
Notes:
• ID is the surrogate sequence number (Primary Key)
• What does the load date tell you?
• Do you notice any overloaded uses for the product number?
• Are there similar keys from different systems?
• Can you spot entry errors?
• Are any patterns visually present?
(C) TeachDataVault.com
28. 2. Links = Associations
Firms Generate Firms Generate
Labels Product Listings
Listings Contain
Firms Manufacture Labeler Codes
Products
Listings for Products
are in NDA Applications
Links = Transactions and Associations
They are used to hook together multiple
sets of information (i.e., Hubs)
(C) TeachDataVault.com
29. Associations = Ontological Hooks
Firm Name
Firms Generate
Product Listings Drug Listing
Firms Manufacture
Product Number
Products
Listings for Products
NDA Application #
are in NDA Applications
Business Keys are associated by many
linking factors, these links comprise the
associations in the hierarchy.
(C) TeachDataVault.com
30. Link Definitions
• What Makes a Link?
– A Link is based on identifiable business element
relationships.
• Otherwise known as a foreign key,
• AKA a business event or transaction between business keys,
– The relationship shouldn’t change over time
• It is established as a fact that occurred at a specific point in time and will
remain that way forever.
– The link table may also represent a hierarchy.
• Attributes
– All attributes are mandatory
(C) TeachDataVault.com
31. Link Entity
A Link is an intersection of business keys.
It can contain Hub Keys and Other Link Keys.
Link Structure Link Line-Item
Primary Key Link Line Item Sequence ID
Unique Index
{Hub Surrogate Keys 1..N} Hub Product Sequence ID
(Primary Index)
Load DTS Hub Order Sequence ID
Record Source Load DTS
Record Source
Note:
• A Link’s Business Key is a Composite Unique Index
• A Link’s Load Date represents the FIRST TIME the EDW saw the relationship.
• A Link’s Record Source represents: First – the “Master” data source (on collisions), if
not available, it holds the origination source of the actual key.
(C) TeachDataVault.com
32. Modeling Links - 1:1 or 1:M?
• Today:
– Relationship is a 1:1 so why model a Link?
• Tomorrow:
– The business rule can change to a 1:M.
– You discover new data later.
• With a Link in the Data Vault:
– No need to change the EDW structure.
– Existing data is fine.
– New data is added.
(C) Kent Graziano
33. Link Table Structures
SQN = Sequence (insertion order)
LDTS = Load Date (when the Warehouse first sees the data)
RSRC = Record Source (System + App where the data ORIGINATED)
(C) TeachDataVault.com
35. Sample Link Entity - Hierarchy
Hub Customer
Link Customer Rollup
ID CUSTOMER # LOAD DTS RCRD SRC
From To LOAD DTS RCRD SRC
CSID 1 ABC123456 10-12-2000 MANUFACT
CSID
1 NULL 10-14-2000 FINANCE 2 ABC925_24FN 10-22-2000 CONTRACTS
3 DKEF 1-25-2001 CONTRACTS
2 1 10-22-2000 FINANCE
4 KKO92854_dd 3-7-2001 CONTRACTS
3 1 2-15-2001 FINANCE
5 LLOA_82J5J 6-4-2001 SALES
4 2 4-3-2001 HR
6 HUJI_BFIOQ 8-3-2001 SALES
5 2 6-4-2001 SALES
7 PPRU_3259 2-2-2002 FINANCE
8 PAFJG2895 2-2-2002 CONTRACTS
9 929ABC2985 2-2-2002 CONTRACTS
10 93KFLLA 2-2-2002 CONTRACTS
Note:
• If you have logic – you can roll together customers, or companies, or sub-assemblies,
bill of materials, etc..
• We do not want to disturb the facts (underlying data in the hub), but we do want to re-
arrange hierarchies at different points over time.
(C) Kent Graziano
36. Link To Link (Link Sale Component)
Sat Totals
Hub Invoice
Link
Sat Dates
Product
Hierarchy
Hub Link Sale Hub
Product Line Item Customer
Sat
Product Link Sale Sat Sat Sat
Desc. Component Quantity Cust Active Address
Sub-Totals
Note:
• Link Sale Component provides a shift in grain.
• Link Sale Component allows for configurable options of products tracked on a single line-item
product sold.
• Link Sale Component provides for sub-assembly tracking.
(C) Kent Graziano
37. 3. Satellites = Descriptors
Firm Patent
Locations Expiration Info
Listing
Formulation
Listing Medication
Product Dosages
Ingredients
Drug Packaging
Types
Satellites = Descriptors
These data provide context for the keys (Hubs)
And for the associations (Links)
(C) TeachDataVault.com
38. Satellite Definitions
• What Makes a Satellite?
– A Satellite is based on an non-identifying business elements.
• Attributes that are descriptive data, often in the source systems known as
descriptions, or free-form entry, or computed elements.
– The Satellite data changes, sometimes rapidly, sometimes
slowly.
• The Satellites are separated by type of information and rate of change.
– The Satellite is dependent on the Hub or Link key as a parent,
• Satellites are never dependent on more than one parent table.
• The Satellite is never a parent table to any other table (no snow flaking).
• Attributes and Ordering
– All attributes are mandatory – EXCEPT END DATE.
– Parent ID 1st, Load Date 2nd, Load End Date 3rd,Record Source
Last.
(C) TeachDataVault.com
39. Descriptors = Context
Firm
Firm Name
Locations
Firms Generate Listing
Product Listings Drug Listing
Formulation
Firms Manufacture
Product Number
Products
Product
Start & End of Ingredients
manufacturing
Context specific point in time
warehousing portion
(C) TeachDataVault.com
40. Satellite Entity
A Satellite is a time-dimensional table housing detailed information
about the Hub’s or Link’s business keys.
Hub Primary Key Customer # • Satellites are defined by
Load DTS Load DTS TYPE of data and RATE OF
Extract DTS Extract DTS CHANGE
Load End Date Load End Date
Detail Customer Name
• Mathematically – this reduces
Business Data Customer Addr1
Customer Addr2
redundancy and decreases
<Aggregation Data> storage requirements over
{Update User} {Update User}
{Update DTS} {Update DTS} time (compared to a Star
Schema)
Record Source Record Source
(C) TeachDataVault.com
41. Satellite Entity- Details
• A Satellite has only 1 foreign key; it is dependent on the
parent table (Hub or Link)
• A Satellite may or may not have an “Item Numbering”
attribute.
• A Satellite’s Load Date represents the date the EDW saw
the data (must be a delta set).
– This is not Effective Date from the Source!
• A Satellite’s Record Source represents the actual source
of the row (unit of work).
• To avoid Outer Joins, you must ensure that every
satellite has at least 1 entry for every Hub Key.
(C) TeachDataVault.com
42. Satellite Table Structures
SQN = Sequence (parent identity number)
LDTS = Load Date (when the Warehouse first sees the data)
LEDTS = End of lifecycle for superseded record
RSRC = Record Source (System + App where the data ORIGINATED)
(C) TeachDataVault.com
43. Satellite Entity – Hub Related
Hub Customer ID CUSTOMER # LOAD DTS RCRD SRC
0 N/A 10-12-2000 SYSTEM
1 ABC123456 10-12-2000 MANUFACT
2 ABC925_24FN 10-2-2000 CONTRACTS
3 ABC5525-25 10-1-2000 FINANCE
CUSTOMER NAME SATELLITE
CSID LOAD DTS NAME RCRD SRC
0 10-12-2000 N/A SYSTEM
1 10-12-2000 ABC Suppliers MANUFACT
1 10-14-2000 ABC Suppliers, Inc MANUFACT
1 10-31-2000 ABC Worldwide Suppliers, Inc MANUFACT
Dummy satellite
1 12-2-2000 ABC DEF Incorporated CONTRACTS
record eliminates
need for outer 2 10-2-2000 WorldPart CONTRACTS
joins during 2 10-14-2000 Worldwide Suppliers Inc CONTRACTS
extract. 3 10-1-2000 N/A FINANCE
(C) Kent Graziano
44. Satellite Entity – Link Related
Link Order Details ID Product ID OrdID LOAD DTS RCRD SRC
0 0 0 10-12-2000 SYSTEM
1 PRD102 1 10-12-2000 MANUFACT
2 PRD103 1 10-2-2000 CONTRACTS
Satellite Order Totals
ID LOAD DTS Tax Total RCRD SRC
0 10-12-2000 <NULL> <NULL> SYSTEM
1 10-12-2000 3.00 0.00 MANUFACT
Dummy satellite
1 10-14-2000 4.00 12.00 MANUFACT
record eliminates
need for outer 1 10-31-2000 3.69 14.02 MANUFACT
joins during 1 12-2-2000 4.69 13.69 CONTRACTS
extract.
2 10-2-2000 2.45 10.00 CONTRACTS
2 10-14-2000 1.22 14.00 CONTRACTS
(C) Kent Graziano
45. Satellite Splits – Type of Information
ID CUSTOMER # LOAD DTS RCRD SRC
Hub Customer 0 N/A 10-12-2000 SYSTEM
1 ABC123456 10-12-2000 MANUFACT
2 ABC925_24FN 10-2-2000 CONTRACTS
3 ABC5525-25 10-1-2000 FINANCE
CUSTOMER SATELLITE
CSID LOAD DTS NAME Contact Sales Rgn Cust Score RCRD SRC
0 10-12-2000 N/A N/A N/A 0 SYSTEM
1 10-12-2000 ABC Suppliers Jen F. SE 102 MANUFACT
1 10-14-2000 ABC Suppliers, Inc Jen F. SE 120 MANUFACT
1 10-31-2000 ABC Worldwide Suppliers, Inc Jen F. SE 130 MANUFACT
1 12-2-2000 ABC DEF Incorporated Jack J. SC 85 CONTRACTS
2 10-2-2000 WorldPart Jenny SE 99 CONTRACTS
2 10-14-2000 Worldwide Suppliers Inc Jenny SE 102 CONTRACTS
3 10-1-2000 N/A N/A N/A 0 FINANCE
(C) Kent Graziano
46. Satellite Splits – Type of Information
ID CUSTOMER # LOAD DTS RCRD SRC
Hub Customer 0 N/A 10-12-2000 SYSTEM
1 ABC123456 10-12-2000 MANUFACT
2 ABC925_24FN 10-2-2000 CONTRACTS
3 ABC5525-25 10-1-2000 FINANCE
Customer Name Satellite Customer Sales Satellite
(name Info) (Sales Info)
• Because of the type of information is different, we split the logical groups
into multiple Satellites.
• This provides sheer flexibility in representation of the information.
• We may have one more problem with Rate Of Change…
(C) Kent Graziano
47. Satellite Splits – Rate of Change
ID CUSTOMER # LOAD DTS RCRD SRC
Hub Customer 0 N/A 10-12-2000 SYSTEM
1 ABC123456 10-12-2000 MANUFACT
2 ABC925_24FN 10-2-2000 CONTRACTS
3 ABC5525-25 10-1-2000 FINANCE
CUSTOMER SATELLITE
CSID LOAD DTS NAME Contact Sales Rgn Cust Score RCRD SRC
0 10-12-2000 N/A N/A N/A 0 SYSTEM
1 10-12-2000 ABC Suppliers Jen F. SE 102 MANUFACT
1 10-14-2000 ABC Suppliers, Inc Jen F. SE 120 MANUFACT
1 10-31-2000 ABC Worldwide Suppliers, Inc Jen F. SE 130 MANUFACT
1 12-2-2000 ABC DEF Incorporated Jack J. SC 85 CONTRACTS
2 10-2-2000 WorldPart Jenny SE 99 CONTRACTS
2 10-14-2000 Worldwide Suppliers Inc Jenny SE 102 CONTRACTS
3 10-1-2000 N/A N/A N/A 0 FINANCE
(C) Kent Graziano
48. Satellite Splits – Rate of Change
ID CUSTOMER # LOAD DTS RCRD SRC
Customer Name Satellite
0 N/A 10-12-2000 SYSTEM
(name Info)
1 ABC123456 10-12-2000 MANUFACT
2 ABC925_24FN 10-2-2000 CONTRACTS
Customer Sales Satellite 3 ABC5525-25 10-1-2000 FINANCE
(Sales Info)
Hub Customer
Customer Scoring
Satellite
• Assume the data to score customers begins arriving in the warehouse
every 5 minutes… We then separate the scoring information from the
rest of the satellites.
• IF we end up with data that (over time) doesn’t change as much as we
thought, we can always re-combine Satellites to eliminate joins.
(C) Kent Graziano
49. Satellites Split By Source System
SAT_SALES_CUST SAT_FINANCE_CUST SAT_CONTRACTS_CUST
PARENT SEQUENCE PARENT SEQUENCE PARENT SEQUENCE
LOAD DATE LOAD DATE LOAD DATE
<LOAD-END-DATE> <LOAD-END-DATE> <LOAD-END-DATE>
<RECORD-SOURCE> <RECORD-SOURCE> <RECORD-SOURCE>
Name First Name Contact Name
Phone Number Last Name Contact Email
Best time of day to reach Guardian Full Name Contact Phone Number
Do Not Call Flag Co-Signer Full Name
Phone Number
Address
City
State/Province
Zip Code
Satellite Structure
PARENT SEQUENCE Primary
LOAD DATE Key
<LOAD-END-DATE>
<RECORD-SOURCE>
{user defined descriptive data}
{or temporal based timelines}
(C) TeachDataVault.com 49
50.
51. Worlds Smallest Data Vault
Hub Customer
Hub_Cust_Seq_ID • The Data Vault doesn’t have to be “BIG”.
Hub_Cust_Num • An Data Vault can be built incrementally.
Hub_Cust_Load_DTS
Hub_Cust_Rec_Src
• Reverse engineering one component of the
existing models is not uncommon.
• Building one part of the Data Vault, then
Satellite Customer Name
Hub_Cust_Seq_ID
changing the marts to feed from that vault
Sat_Cust_Load_DTS
is a best practice.
Sat_Cust_Load_End_DTS
Sat_Cust_Name
Sat_Cust_Rec_Src
• The smallest Enterprise Data Warehouse
consists of two tables:
– One Hub,
– One Satellite
(C) TeachDataVault.com
52. Top 10 Rules for DV Modeling
Business keys with a low propensity for change become Hub keys.
Transactions and integrated keys become Link tables.
Descriptive data always fits in a Satellite.
1. A Hub table always migrates its’ primary key outwards.
2. Hub to Hub relationships are allowed only through a link structure.
3. Recursive relationships are resolved through a link table.
4. A Link structure must have at least 2 FK relationships.
5. A Link structure can have a surrogate key representation.
6. A Link structure has no limit to the number of hubs it integrates.
7. A Link to Link relationship is allowed.
8. A Satellite can be dependent on a link table.
9. A Satellite can only have one parent table.
10. A Satellite cannot have any foreign key relationships except the primary key to
the parent table (hub or link).
(C) TeachDataVault.com
53. NOTE: Automating the Build
• DV is a repeatable methodology with rules and standards
• Standard templates exist for:
– Loading DV tables
– Extracting data from DV tables
• RapidAce (www.rapidace.com – now Open Source)
– Software that applies these rules to:
• Convert 3NF models to DV
• Convert DV to Star Schema
• This could save us lots of time and $$
(C) Kent Graziano
54. In Review…
• Data Vault is…
– A Data Warehouse Modeling Technique (&
Methodology)
– Hub and Spoke Design
– Simple, Easy, Repeatable Structures
– Comprised of Standards, Rules & Procedures
– Made up of Ontological Metadata
– AUTOMATABLE!!!
• Hubs = Business Keys
• Links = Associations / Transactions
• Satellites = Descriptors
(C) TeachDataVault.com
55. The Experts Say…
“The Data Vault is the optimal choice
for modeling the EDW in the DW 2.0
framework.” Bill Inmon
“The Data Vault is foundationally
strong and exceptionally scalable
architecture.” Stephen Brobst
“The Data Vault is a technique which some industry
experts have predicted may spark a revolution as the
next big thing in data modeling for enterprise
warehousing....” Doug Laney
56. More Notables…
“This enables organizations to take control of
their data warehousing destiny, supporting
better and more relevant data warehouses in
less time than before.” Howard Dresner
“[The Data Vault] captures a practical body of
knowledge for data warehouse development
which both agile and traditional practitioners
will benefit from..” Scott Ambler
58. Growing Adoption…
• The number of Data Vault users in the US
surpassed 500 in 2010 and grows rapidly
(http://danlinstedt.com/about/dv-
customers/)
(C) Kent Graziano
59. Conclusion?
Changing the direction of the river
takes less effort than stopping the flow
of water
(C) TeachDataVault.com
60.
61. Where To Learn More
The Technical Modeling Book: http://LearnDataVault.com
On YouTube: http://www.youtube.com/LearnDataVault
On Facebook: www.facebook.com/learndatavault
Dan’s Blog: www.danlinstedt.com
The Discussion Forums: http://LinkedIn.com – Data Vault Discussions
World wide User Group (Free): http://dvusergroup.com
The Business of Data Vault Modeling
by Dan Linstedt, Kent Graziano, Hans Hultgren
(available at www.lulu.com )
61