Data Vault: The Next Evolution in Data Modeling, Part 1
By Dan Linstedt
Editor’s note: Dan Linstedt has a presentation that further develops the concept of the Data Vault
as the Next Evolution in Data Modeling in the "Building a DW Infrastructure to Support BI
Initiatives" online trade show now available at www.dataWarehouse. com/tradeshow/.
The purpose of this article is to present and discuss a patent-pending technique called the Data Vault – the
next evolution in data modeling for enterprise data warehousing. This is a highly technical paper and is
meant for an audience of data modelers, data architects and database administrators. It is not meant for
business analysts, project managers or mainframe programmers. It is recommended that there is a base level
of knowledge in common data modeling terms such as table, relationship, parent, child, key
(primary/foreign), dimension and fact.
For too long we have waited for data structures to finally catch up with artificial intelligence and data mining
applications. Most of the data mining technology has to import flat file information in order to join the form
with the function. Unfortunately, volumes in data warehouses are growing rapidly and exporting this
information for data mining purposes is becoming increasingly difficult. It simply doesn’t make sense to
have this discontinuity between form (structure), function (artificial intelligence) and execution (the act of
Marrying form, function and execution holds tremendous power for the artificial intelligence (AI) and data
mining communities. Having data structures that are mathematically sound increases the ability to bring
these technologies back into the database. The Data Vault is based on mathematical principles that allow it to
be extensible and capable of handling massive volumes of information. The architecture and structure is
designed to handle dynamic changes to relationships between information.
A stretch of the imagination might be to one day encapsulate the data with the functions of data mining,
hopefully to move towards a "self-aware" independent piece of information – but that’s just a dream for now.
It is possible to form, drop and evaluate relationships between data sets dynamically. Thus changing the
landscape of what is possible with a data model; essentially bringing the data model into a dynamic state of
flux (through the use of data mining/artificial intelligence).
By implementing reference architectures on top of a Data Vault structure – the functions that access the
content may begin to execute in parallel and in an automated dynamic fashion. The Data Vault solves some
of the enterprise data warehousing structural and storage problems from a normalized, best of breed
perspective. The concepts provide a whole host of opportunities in applying this unique technology.
Defining a Data Vault
Definition: The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized
tables that support one or more functional areas of business. It is a hybrid approach encompassing the best
of breed between third normal form (3NF) and star schema. The design is flexible, scalable, consistent and
adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs
of enterprise data warehouses.
The Data Vault is architected to meet the needs of the data warehouse, not to be confused with a data mart. It
can double as an operational data store (ODS) if the correct hardware and database engine is in place to
support it. The Data Vault can handle massive sets of granular data in a smaller, more normalized physical
space in comparison to both 3NF and star schema. The Data Vault is foundationally strong. It is based on the
mathematical principles that support the normalized data models. Inside the Data Vault model are familiar
structures that match traditional definitions of star schema and 3NF that include dimensions, many to many
linkages and standard table structures. The differences lie in relationship representations, field structuring
and granular time-based data storage. The modeling techniques built into the Data Vault have undergone
years of design and testing across many different scenarios providing them with a solid foundational
approach to data warehousing.
A Brief History of Data Modeling for Data Warehousing
3NF was originally built in the early 1960s (Codd & Date) for online transaction processing (OLTP)
systems. In the early 1980s, it was adapted to meet the growing needs of data warehouses. Essentially a data/
time stamp was added to the primary keys in each of the table structures. (See Figure 1.)
In the mid to late 1980s, star schema data modeling was introduced and perfected. It was architected to solve
subject-oriented problems including (but not limited to) aggregations, data model structural change, query
performance, reusable or shared information, ease of use and the ability to support online analytical
processing (OLAP). This single subject-centric architecture became known as a data mart. . Soon thereafter it
too was adapted to multi-subject data warehousing as an attempt to meet the growing needs of enterprise
data warehousing. The term for this is conformed data marts.
Figure 1: Data Model Time Line
Performance and other weaknesses of 3NF and star schema (when used within an enterprise data warehouse)
began to show in the 1990s as the volume of data increased. The Data Vault is architected to overcome these
shortcomings while retaining the strengths of 3NF and star schema architectures. Within the past year, this
technique has been favorably received by industry experts. The Data Vault is the next evolution in data
modeling because it’s architected specifically for data enterprise warehouses.
The Problems of Existing Data Warehouse Data Modeling Architectures
Each modeling technique has limitations when they are applied to enterprise data warehouse architecture.
This is because they are an adaptation of a design rather than a design built specifically for the task. These
limitations reduce usability and are constantly contributing to the "holy wars" in the data warehousing world.
The following paragraphs are with respect to these architectures being applied as data warehouses, not as
their respective original purposes.
3NF has the following issues to contend with including: time-driven primary key issues causing parent-child
complexities, cascading change impacts, difficulties in near real-time loading, troublesome query access,
problematic drill-down analysis, top-down architecture and unavoidable top-down implementation. Figure 2
is an original 3NF model adapted to data warehousing architecture. One particularly thorny problem is
evident when a data/time stamp is placed into the primary key of a parent table. This is necessary in order to
represent changes to detail data over time.
The problem is scalability and flexibility. If an additional parent table is added, the change is forced to
cascade down through all subordinate table structures. Also, when a new row is inserted with an existing
parent key (the only field to change is the data/time stamp), all child rows must be reassigned to the new
parent key. This cascading effect has a tremendous impact on the processes and the data model – the larger
the model the greater the impact. This makes it difficult (if not impossible) to extend and maintain an
enterprise-wide data model. The architecture and design suffer as a result.
Figure 2: Data/Time Stamped 3NF
The conformed data mart also has trouble. It is a collection of fact tables that are linked together via primary/
foreign keys – in other words, a linked set of related star schemas. The problems this creates are numerous:
isolated subject oriented information, possible data redundancy, inconsistent query structuring, agitated
scalability issues, difficulties with fact table linkages (incompatible grain), synchronization issues in near
real-time loading, limited enterprise views and troublesome data mining. While the star schema is typically
bottom-up architecture, bottom-up implementation – the conformed data mart should be top-down
architecture and bottom-up implementation. However, informal polling has shown that bottom-up
architecture and bottom-up implementation appear to be the standard.
One of the most difficult issues of a conformed data mart (or conformed fact tables) is getting the grain right.
That means understanding the data as it is aggregated for each fact table and assuring that the aggregation
will stay consistent for all time (during the life of the relationship) and the structure of each fact table will not
change (i.e., no new dimensions will be added to either fact table). This limits design, scalability and
flexibility of the data model. Another issue is the "helper table." This table is defined to be a dimension-to-
dimension relationship link. Granularity is very important, as is the stability of the design of the dimension.
This too limits design, scalability and flexibility of the data model.
Figure 3: Conformed Data Mart
If the granularity of the Revenue Fact is altered, then it is no longer the same (duplicate) fact table. By
adding a dimension to one of the fact table,s the granularity frequently changes. It has also been suggested
that fact tables can be linked together just because they carry the same dimension keys. This is only true if
the facts are aggregated to the same granularity, which is an extremely difficult task to maintain as the
system grows and matures.
The Importance of Architecture and Design for Enterprise Data Warehousing
A data warehouse should be top-down architecture and bottom-up implementation. This allows the
architecture to reach the maximum known knowledge boundaries while the implementation can be scope
controlled which can facilitate fast delivery times. The implementation should, therefore, be designed as a
plug-and-play set of tables without becoming a stovepipe upon delivery. The design and architecture of a
data warehouse must be flexible enough to grow and change with the business needs, because the needs of
today are not necessarily the needs of tomorrow.
Our industry has a need for a formalized data modeling architecture and design that is capable of accurately
representing data warehouses. The architecture must be a defined normalization for data warehousing versus
a defined normalization for OLTP systems. For example, the defined normalization of OLTP is 1st, 2nd and
3NF; of course this includes 4th, 5th and maybe 6th normal form. Data warehousing today does not have such a
structured or predefined normalization for data modeling. It is also apparent that it is no longer sufficient to
have a haphazard normalization effort for an enterprise data warehousing architecture. Inconsistencies in
modeling techniques lead to maintenance intensive implementations.
The Data Vault is a defined normalization of data modeling for data warehouses. Its strength lies in the
structure and usage from which the model is built. It utilizes some of the following data modeling
techniques: many-to-many relationships, referential integrity, minimally redundant data sets and business
function keyed information hubs. These techniques make the Data Vault data model flexible, expandable and
consistent. The approach to building a Data Vault data model is iterative, which provides a platform for data
architects and business users to construct enterprise data warehouses in a component-based fashion (see Bill
Inmon’s article: "Data Mart Does Not Equal Data Warehouse," DM Review. May 1998.
The Data Vault Components
In order to keep the design simple, yet elegant, there are a minimum number of components, specifically the
hub, link and satellite entities. The Data Vault design is focused around the functional areas of business with
the hub representing the primary key. The link entities provide transaction integration between the hubs. The
satellite entities provide the context of the hub primary key. Each entity is designed to provide maximum
flexibility and scalability while retaining most of the traditional skill sets of data modeling expertise.
Hub entities, or hubs, are a single table carrying at a minimum a unique list of business keys. These are the
keys that the businesses utilize in every day operations. For example, invoice number, employee number,
customer number, part number and vehicle identification number (VIN). If the business were to lose the key,
they would lose the reference to the context, or surrounding information. Other attributes in the hub include:
• Surrogate Key – Optional component, possibly a smart key or a sequential number.
• Load Data/Time Stamp – Recording when the key itself first arrived in the warehouse.
• Record Source – A recording of the source system utilized for data traceability.
Figure 4 represents what a customer hub might look like. It is a visual representation of the structure within
the database system. In this instance, the customer number is the primary business key, while the ID is a
customer surrogate – assigned for join reasons and reduction of storage.
For example, the requirement is to capture customer number across the company. Accounting may have a
customer number (12345) represented in a numeric style and contracts may have the same customer number
prefixed with an alpha (AC12345). In this case, the representation of the customer number in the hub would
be alphanumeric and set to the maximum length to hold all of the customer numbers from both functional
areas of business. The hub would have two entries: 12345 and AC12345, each would have their own record
source – one from accounting and one from contracts. The obvious preference is to perform cleansing and
matching on these numbers to integrate them together. However that topic is out of scope for this paper. The
hub’s primary key always migrates outward from the hub. Once the business is correctly identified through
keys (say customer and account) the link entities can be constructed.
Figure 4: Example Customer Hub
Link entities, or links, are a physical representation of a many-to-many 3NF relationship. The link represents
the relationship or transaction between two or more business components (two or more business keys). It is
instantiated (physically) in the logical model in order to add attributes and surround the transaction with
context (this is discussed in the satellite entity description next). The link contains the following attributes:
• Surrogate Key – Optional component, possibly a smart key or a sequential number. Only utilized if
there are more than two hubs through this link, or the composite primary key might cause
• Hub 1 Key to Hub N Key – Hub keys migrated into the link to represent the composite key or
relationship between two hubs.
• Load Data/Time Stamp – Recording when the relationship/transaction was first created in the
• Record Source – A recording of the source system utilized for data traceability.
Figure 5: Example Link Table Structure
This is an adaptation of a many-to-many relationship in 3NF in order to solve the problems related to
scalability and flexibility. This modeling technique is designed for data warehouses, not for OLTP systems.
The application loading the warehouse must undertake the responsibility of enforcing one-to-many
relationships if that is the desired result. Please note that some of the foundational rules for data modeling
with the Data Vault will be listed at the end of this document. With just a series of hubs and links, the data
model will begin to describe the business flow. The next component is to understand the context around
when, why, what, where and who constructed both the transaction and the keys themselves. For example, it
is not enough to know what a VIN number is for a vehicle or that there is a driver number five out there
somewhere. The customer is looking to know what the VIN represents (i.e., a blue Toyota pickup, 4WD,
etc.) and that driver number five represents the name Jane and then they may want to know that Jane is the
driver of this particular VIN.
Satellite entities, or satellites, are hub key context (descriptive) information. All of its information is subject
to change over time; therefore, the structure must be capable of storing new or altered data at the granular
level. The VIN number should not change, but if a wrecking crew rebuilds the Toyota – chops the top and
adds a roll bar, it may not be a pickup anymore. What if Jane sells the car to someone else, say driver number
six? The satellite is comprised of the following attributes:
• Satellite Primary Key: Hub or Link Primary Key – Migrated into the satellite from the hub or link.
• Satellite Primary Key: Load Data/Time Stamp – Recording when the context information is
available in the warehouse (the new row is always inserted).
• Satellite Optional Primary Key: Sequence Surrogate Number – Utilized for satellites that have
multiple values (such as a billing and home address) or line item numbers used to keep the satellites
subgrouped and in order.
• Record Source – A recording of the source system utilized for data traceability.
Figure 6: Customer Name Satellite
In Figure 6, we are able to show the changes to customer name over time. The figure also depicts the
different source systems from which the rows originated. This allows the warehouse to store the information
at the most granular level, while maintaining an audit trail in the warehouse for traceability reasons. Notice
the load_dts is part of the composite primary key, because of this the data is ordered on a time basis – and is
referenced through the utilization of the customer id surrogate key.
The satellite is most closely related to a Type 2 dimension as defined by Ralph Kimball. It stores deltas at a
granular level; its function is to provide context around the hub key. For example, the fact that VIN 1234567
represents a blue Toyota truck today and a red Toyota truck tomorrow. Color may be a satellite for
automobile. Its design relies on the mathematical principles surrounding reduction of data redundancy and
rate of change. For instance, if the automobile is a rental, the dates of availability/rented might change daily
which is much faster than the rate of change for color, tires or owner. The issue that the satellite solves is
defined as follows:
An automobile dimension may contain 160+ attributes; if the color or tires change then all 160+ attributes
must be replicated into a new row (if utilizing a Type 2 dimension). Why replicate data when the rest of the
attributes are changing at slower rates of change? If utilizing a Type 1 or Type 3 dimension it is possible to
loose partial or complete historical trails. In this case, the data modeler should construct at a minimum two
satellites: dates of availability and maintenance/parts. If the customer who rents the auto the first day is Dan
and the second day is Jane, then it is the link’s responsibility to represent the relationship. The data modeler
might attach one or more satellites on the link representing dates rented (from/to), condition of vehicle and
comments made by the renter.
Figure 7: Sample Point-In-Time Table
The point-in-time table is a satellite derivative. It is built to assist queries in their effort to find information at
specific points in time. It is the system of record for the historical pictures that are gathered by the different
satellites. It is typically only built if there are two or more satellites surrounding a hub. With one satellite, a
correlated sub-query will work just fine. This table can also shed light on rates of change comparisons and
from-to date stamping across the satellites. In other words some of the statistics which are produced by this
table can provide insight into how often different information is changing.
A note about date/time stamping; while date/time stamps are shown here in the load date fields, it is possible
to use a surrogate key (numeric) that points to a time table. This may shorten all the tables’ width and will
allow greater flexibility in resolving date issues. Also note, that a date/time stamp is usually a load-date/time
– to show when the information arrives in the warehouse however it is also possible to utilize source system
record creation dates but only if available. The issue here is consistency. All the data in the warehouse must
be consistently stamped in order to achieve a system of record that is understandable to the business
Building a Data Vault
The Data Vault should be built as follows:
1. Model the Hubs. This requires an understanding of business keys and their usage across the
2. Model the Links. Forming the relationships between the keys – formulating an understanding of how
the business operates today in context to each business key.
3. Model the Satellites. Providing context to each of the business keys as well as the transactions (links)
that connect the hubs together. This begins to provide the complete picture of the business.
4. Model the Point-in-Time Tables. This is a satellite derivative, of which the structure and definition is
outside the scope of this document (due to space constraints).
There are methods for representing external sources such as flat files, excel feeds and user defined tab
delimited files – due to time and space constraints, these items will not be discussed here. No matter what
type of source, all the structures and modeling techniques apply.
Reference rules for Data Vaults:
• Hub keys cannot migrate into other hubs (no parent/child-like hubs). To model in this manner breaks
the flexibility and extensibility of the Data Vault modeling technique.
• Hubs must be connected through links.
• More than two hubs can be connected through links.
• Links can be connected to other links.
• Links must have at least two hubs associated with them in order to be instantiated.
• Surrogate keys may be utilized for hubs and links.
• Surrogate keys may not be utilized for satellites.
• Hub keys always migrate outward.
• Hub business keys never change, hub primary keys never change.
• Satellites may be connected to hubs or links.
• Satellites always contain either a load date/time stamp or a numeric reference to a standalone load
date/time stamp sequence table.
• Standalone tables such as calendars, time, code and description tables may be utilized.
• Links may have a surrogate key.
• If a hub has two or more satellites, a point-in-time table may be constructed for ease of joins.
• Satellites are always delta driven, duplicate rows should not appear.
• Data is separated into satellite structures based on: 1) type of information 2) rate of change.
These simple components hub, link and satellite combine to form a Data Vault. A Data Vault can be as small
as a single hub with one satellite or as large as the scope permits. The scope can always be modified at a later
date, and scalability is not an issue (nor is granularity of the information). A data modeler can convert small
components of their existing data warehouse model to a Data Vault architecture one piece at a time. This is
because the changes are isolated to the hub and satellites. The business (how functional areas of business
interact) is represented by the links. In this manner the links can be end dated, rebuilt, revised and so on.
Solving the Pain of Data Warehouse Architectures
3NF and star schema when used for enterprise data warehousing may cause pain to the business because they
were not built originally for this purpose. There are issues surrounding scalability, flexibility and granularity
of data, integration and volume. The volume of information that warehouses are required to store today is
exponentially increasing every year. CRM, SCM, ERP and all the other large systems are forcing volumes of
information to be fed to the warehouses. The current data models based on 3NF or star schema are proving
difficult to modify, maintain and query, let alone backup and restore.
In the example previously provided, if the scope is to warehouse vehicle data and the corresponding
attributes over time – that is a single Data Vault comprised of a single hub with a few satellites. A year later,
if the business wants to warehouse contracts with that vehicle, hubs and links can be added easily. No
worries about granularity. This type of model extends upward and outward (bottom-up implementation, top-
down architecture). The end result is always foundationally strong and can be delivered with an iterative
Another example is the power of the link entity. Suppose a company sells products today, has a product hub,
an invoice hub and a link between the two. Then the company decides to sell services. The data model can
establish a new services hub, end date the entire set of product links and start a new link between services
and invoices. No data is lost and all data going back over time is preserved – matching the business change.
This is only one of many different possibilities for handling this situation.
Volume causes query issues, particularly with the structures of star schema but not so much with 3NF.
Volume is breaking queries that are after the information in conformed dimensions and conformed fact
tables. Partitioning is often required and the structures are continually reworked to provide additional
granularity to the business users. This promotes a management and maintenance nightmare. Reloading an
ever-changing star is difficult – let alone attempting to perform this with volume (upwards of 1 Terabyte for
instance). The Data Vault is rooted in the fundamentals of mathematics that are squarely behind the
normalized data model. Reduction of redundancy and accounting for rates of change among data sets
contribute to increased performance and easier maintenance. The Data Vault architecture is not limited to
fitting on a single platform. The architecture allows for a distributed yet interlinked set of information.
Data warehouses must frequently deal with the statement: what I (the user) will give you won’t ever come
from the source system. Then they proceed to provide a spreadsheet with their daily maintained interpretation
of the information. In other words: I (the customer) want to see all VIN numbers that start with X rolled up
under label BIG TRUCKS. What the Data Vault provides for this is called a user grouping set. It’s another
hub (label Big Trucks) with a satellite describing which VIN numbers roll under this label and a link to the
VIN numbers themselves. In this manner, the original data from the source system are preserved while the
query tools can view the information in a manner appropriate to the users’ needs. When all is said and done a
data warehouse is successful if it meets the users’ needs.
The Foundations of the Data Vault Architecture
The architecture is rooted in the mathematics of reduction of redundancy. The satellites are set up to store
only the deltas, or changes, to the information within. If a single satellite begins to grow to quickly, it is very
easy to create two new satellites and run a delta splitter process; a process that will split the information into
the two new satellites, each process running another delta process before inserting the new rows. This
process can keep the rates of duplication of columnar data down to a minimum. It equates to utilizing less
storage. Satellites by nature can be very long and, in most cases, are geared to be narrow (not many
columns). In comparison, Type 2 dimensions may replicate data across many columns, making copies of
information over and over again as well as generating new keys.
The hubs store a single instance of the business key. The business keys most often have a very low
propensity relative to change. Because of this, surrogate keys mapped to business keys (if surrogates are
utilized) are a one-to-one mapping and never change. The primary key of the hub (regardless of the type –
business or surrogate) is the only component of information replicated across the satellites and links. Because
of this the satellites are always tied directly to the business key. In this manner satellites are relegated to
describing the business key at the most granular level available. This provides a basis for "context" about a
business key to be developed.
Another unique result of the Data Vault is the ability to represent relationships, dynamically. Relationships
are founded in a link structure the first time the business keys are "associated" in incoming source data. This
relationship exists until it’s either end dated (in a satellite) or deleted from the data set completely. The fact
that this relationship is represented in this manner opens up new possibilities in the area of dynamic
relationship building. If new relationships between two hubs (or their context) are discovered as a result of
data mining, new links can be formed automatically.
Likewise, link structures and information can be end dated or deleted at the time when they are no longer
relevant. For example: a company is selling products today and has a link table between products and
invoices. Tomorrow, they begin selling services. It may be as simple as constructing a service hub and a link
between invoices and services – then end dating all the relationships between the products and invoices. In
this example, the process of changing the data model can begin to be programmatically explored. Which if
automated, would dynamically change and adapt the structure of the data warehouse to meet the needs of the
Rates of change and reduction of redundancy along with the flexibility of potentially unlimited dynamic
relationship alteration form a powerful foundation. These items open doors in the application of the Data
Vault structures to many different purposes.
Possible Applications of the Data Vault
As a result of the foundations, many different applications of the Data Vault may be considered. A few of
these are already in the throws of development. A small list of these possibilities is below:
• Dynamic Data Warehousing – Based on dynamic automated changes made to both process and
structure within the warehouse.
• Exploration Warehousing – Allowing users to play with the structures of the data warehouse without
losing the content.
• In-Database Data Mining – Allowing the data mining tools to make use of the historical data, and to
better fit the form (structure) with the function of data mining/artificial intelligence.
• Rapid Linking of External Information – An ability to rapidly link and adapt structures to bring in
external information and make sense of it within the data warehouse without destroying existing
The business of data warehousing is evolving – it must move in order to survive. The architecture and
foundations behind what data warehousing means will continue to change. The Data Vault overcomes most
of the problems and limitations of the past and stands ready to meet the challenges of the future.
Data Vault: The Next Evolution in Data Modeling, Part 2
By Dan Linstedt
The purpose of this two-part series is to present and discuss a patent-pending technique called a Data Vault –
the next evolution in data modeling for enterprise data warehousing. The audience of this paper should be the
data modelers who wish to construct a Data Vault data model. This article focuses on a specific example:
The Microsoft SQLServer 2000 Northwind Database. It is suggested that for the purposes of this discussion,
the reader obtains at a minimum, a trial copy of the SQLServer 2000 database engine. Please read Part 1,
which defines the Data Vault architecture. This will provide the context on what the data model is and how it
fits into business.
Let’s consider this for a moment: suppose it is possible to reverse engineer a data model into a warehouse.
What would that mean for a data warehousing project? Suppose it could be done in an automated fashion,
would that help or hurt? What if the only consideration necessary to make is how to integrate different
aspects of the generated data models? These and many more questions come to mind when beginning to
consider the automation of data modeling for data warehousing, particularly when the consideration involves
For our purposes, having this functionality to produce a baseline Data Vault would be of tremendous help.
The Northwind data model was converted both by hand and through an automated fashion. When the two
data models were compared they only had minor differences. Further examination showed that hand
conversion opened up the possibilities for errors in link tables where the automated converter kept the links
clean. Some of the most important items to mechanizing the process are naming conventions, abbreviation
conventions and specification of primary/foreign keys.
What’s important here is that this is a baby step into the application of "dynamic data warehousing" or
dynamic model changes (please see my other article: Bleeding Edge Data Warehousing – due out in the
Journal of Data Warehousing Fall 2002). It also provided a data model in 10 minutes (for this particular
example), when it took roughly two hours to convert it by hand. It then took an additional 20 minutes to
adjust the model slightly and implement it. Keep in mind, this is a small data model and all that is proposed
is auto-vaulting of one OLTP data model at a time. The automated process isn’t smart enough yet to integrate
end-resulting Data Vault data models.
The DDL is available on a Web link: http://www.coreintegration.com. Sign in to our free online community,
the Inner Core, and click on Downloads, select Data Warehousing, then find the zip file titled:
DataVault2DDL.zip (for this series). The DDL and the views are built for Microsoft SQLServer 2000. Feel
free to convert them to a database of your choice. The automated mechanism is not available today – it’s still
in an experimental phase. The DDL contains the tables for a Data Vault and the views to populate the
structure both initially and with changes.
Please keep in mind; this is not a "perfect Vault" and has not been conditioned to be the same quality of data
model as delivered to the customer. This is meant as an example only, for trial purposes. Feel free to contact
me directly with questions or comments.
Examining an OLTP 3NF Model for Conversion
Some OLTP data models in 3NF are easier to convert than others. However, there are some distinctive
properties which make the conversion process easy. Here are a few items to look for:
1. How well does the data model adhere to standard naming conventions? This will have an effect on
integration of fields. If the fields (attributes) are named the same across the model, then the resulting
Data Vault will be easier to build as well as easier to identify which components have been
2. How many independent tables are in the data model? Independent tables usually don’t integrate in a
Data Vault very well. It is a stretch to integrate the table through field name matching. Normally
these independent tables are copied across into a Data Vault as standalone tables – until integration
points can be found.
3. Have primary and foreign key relationships been defined? If referential integrity has been turned off
in the data model, it will be exceedingly difficult to create a Data Vault model. It is near impossible
to automatically convert it, however, through some hard work and rolling up of the sleeves (digging
into business requirements) it can be done effectively.
4. Does the model utilize surrogate keys instead of natural keys? The converted data model favors
natural keys over surrogate keys. The models that are converted by hand require that the data
modeler understands the business well enough to identify the business keys (natural keys) and their
mapping to the surrogate keys.
5. Does the model match the business requirements for the data warehouse? If the requirements are to
integrate or consolidate data across the OLTP system (such as a single customer view or a single
address view) then the process of converting to a Data Vault may be a little more difficult. The
process may require cross-mapping data elements for integration purposes.
6. Can the information be separated by class or type of data? In other words, can all the addresses be
put in a single table, all the parts in another table, all the employees, etc? Separating the classes of
data helps with the integration effort. This is usually a manual cross-mapping and regrouping of
7. How quickly do certain attributes change? A Data Vault likes to separate data by rates of change. It
is easier to model a Data Vault if an understanding of the rates of change of the underlying
information is known.
These are just a set of suggested items to consider before converting the data model. They are by no means a
complete list. First and foremost, the data warehouse data model should always follow the business
requirements, regardless of how the base-line or initial model is generated. It is suggested that a scorecard
approach be developed. Over time, these items will be on a scale of difficulty, when that happens – it will
provide a good guideline as to the "convertibility" of a particular data model. Future series will cover
migrating conformed data marts and other types of adapted 3NF EDW to a Data Vault schema.
The Northwind Database
Northwind is built by Microsoft, and is installed on every Microsoft SQLServer 2000 database. It is freely
accessible with sample data. The data model is shown in Figure 1.
In this model, the first thing to notice is the use of non-standard data types: bit, ntext, image, money. These
don’t port very well to other relational databases. This is important to resolve because most of the data
warehouses are not built on the same database engine as their OLTP counterparts. In this case a Data Vault
will be built on the same RDBMS engine. Another item that pops out of the data model is the recursive
relationship. Immediately, this should signal a necessary change to the data model.
The naming conventions appear consistent across the model. ID is used synonymously with primary keys,
primary and foreign keys are defined, there are no independent tables and the model does appear to use some
surrogate and some natural keys. For the sake of discussion, the business requirements are to house all of the
data in the warehouse and store only incremental changes to the data over time.
The attributes could be classed out (normalized) further if desired, items such as address, city, region and
postal code can all be grouped. Do certain attributes change faster than others? From looking at the model,
the two tables with the most changes might be orders and order details. There really isn’t a method that will
help the discovery of rapidly changing elements in this model. Normally rapid changing elements are either
indicated by business users or provided in audit trails, usage logs or through time stamps on the data itself. In
this case, none of these are present.
Figure 1: The Northwind Physical 3NF Data Model
The Process of Modeling a Data Vault
In order to keep the design simple, yet elegant, there are a minimum number of components, specifically the
hub, link and traditional skill sets of data modeling expertise. These were defined in Part 1. Please refer to
the first article for definitions and table structure setup. This section will discuss the process of converting
the above data model to an effective Data Vault. The steps for a single model conversion without integration
are as follows:
1. Identify the business keys and the surrogate key groupings, model the hubs.
2. Identify the relationships between the tables that must be supported, model the links.
3. Identify the descriptive information, model the satellites.
4. Regroup the satellites (normalize) by rates of change or types of information.
To address more than one model start with the business identified "master system." Build the first data model
and then incrementally map other data models and data elements into the single unified view of information.
There are three styles to load dates in the EDW Data Vault architecture and before modeling can begin. It is
wise to chose a style that suites your needs. The styles are as follows:
1. Standard Load Date as indicated by Part 1 and 2 of this article. This is easy to load, difficult to query.
For more than two satellite tables off a hub it may require an additional "picture table" or point- in-
time satellite to house the delta changes for equi-joins.
2. Load Date data type altered to be an integer reference to a Load Table where the date is stored.
Integer’s reference is a stand- alone foreign key to a load table and can be used if date logic is not
desired. Be aware that this can cause difficulties in reloading, and resequencing the keys in the
warehouse. This is not a recommended practice/style.
3. Load End Date is added to all the satellites. Rows in satellites are end dated as new rows are inserted.
This can help the query perspective and at the same time can make loading slightly more complex.
Using this style, it may not be necessary to construct a picture table (point-in-time satellite).
Select the style that best suits the business needs and implement it across the model. Part of the Data Vault
modeling success is consistency. Stay consistent with the style that’s chosen and the model will be solid from
a maintenance perspective.
Since the hubs are a list of business keys it is important to keep them together with any surrogate keys (if
surrogates are available). Upon examination of the model we find the following business key/surrogate key
groupings (the examination included unique indexes and a data query):
• Categories: CategoryName is the business key, CategoryID is the surrogate key. This will constitute
a HUB_Category table.
• Products: ProductName is the business key, ProductID is the surrogate key. This will constitute a
• Suppliers: SupplierName is the business key, SupplierID is the surrogate key. This will constitute a
• Order Details: has no business key, and cannot "stand on its own." Therefore, it is NOT a hub table.
• Orders: Appears to have a surrogate key – which may or may not constitute a business key (depends
on the business requirements). Upon further investigation we find many foreign keys. The table
appears to be transactional in nature which makes it a good candidate for a link rather than a hub
• Shippers: CompanyName is a business key, and ShipperID is the surrogate key. Shippers will
constitute a HUB_Shippers Table. If the business requirements state that an integration of
"companies" is required, then the CompanyName field in Shippers can be utilized. However, if the
business requirements state that shippers must be kept separate, then CompanyName is not
descriptive enough and should be changed to ShipperName in order to keep with the current field
• Customers: CompanyName is the business key, and CustomerID is the surrogate key. Customers
will constitute a HUB_Customers table. Again, if integration is desired, then maybe an entity called:
HUB_Company would be constructed (to integrate Customers and Shippers).
• CustomerCustomerDemo: Has no real business key and cannot stand on its own; therefore it will be
a link table.
• CustomerDemographics: Upon first glance, CustomerDesc appears to be the business key with
CustomerTypeID being the surrogate key; however, this could also be constructed as a satellite of
Customer. Remember that the warehouse is meant to capture the source system data, not enforce the
rules of capture. For the purposes of this discussion, HUB_CustomerDemographics will be
• Employees: EmployeeName appears to be the best business key, with EmployeeID being the
surrogate key. This will constitute a HUB_Employee table.
• EmployeeTerritories: There appears to be no real business key here, this will not constitute a HUB
table, most likely it will become a link table.
• Territories: TerritoryDescription appears to be the business key, with TerritoryID being the surrogate
key. This will constitute a HUB_Territories table.
• Region: RegionDescription is clearly the business key, RegionID is the surrogate key. This table will
constitute a HUB_Region table.
Once the analysis has been done for each of the table structures, we can assemble the list of hub tables that
will be built: Hub_Category, Hub_Product, Hub_Supplier, Hub_Shippers, Hub_Customer,
Hub_CustomerDemographics, Hub_Employee, Hub_Territories. There are a couple of questionable items
which depending on the business rules may have their structure integrated. Remember that the hub structures
are all very similar. An example of the Hub_Category is given in Figure 2.
Figure 2: Example of Hub_Category
Now that we have the hub structures, we can move on to the links. The function of the hubs is to integrate
and centralize the business around the business keys.
The links represent the business processes, the glue that ties the business keys together. They describe the
interactions and relationships between the keys. It is important to realize that the business keys and the
relationships that they contain are the most important elements in the warehouse. Without this information,
the data is difficult to relate. Typically transactions and many-to-many tables constitute good link tables.
Along with that, any table that doesn’t have a respective business key becomes a good link entity. Tables
with a single attribute primary key mostly make a good hub Table, however the requirement is still for a
business key. In the case of Orders, a business key does not exist. The link tables of our model are as
• Order Details: Many-to-many table, excellent link table. LNK_OrderDetails will be constituted.
• Orders: Many to many, parent transaction of Order Details, excellent link table. LNK_Orders will be
constituted. However, please note: It may or may not be appropriate to constitute hub_Orders as a
hub table depending on the business – and it’s desire to track Order ID. In this case, we will keep it
as a link table.
• CustomerCustomerDemo: Many-to-many table, excellent link table. LNK_CustomerCustomerDemo
will be constituted.
• EmployeeTerritories: Many-to-many table, excellent link table. LNK_EmployeeTerritories will be
Did we get all the linkages? No. Look again. There are some parent/child foreign key relationships in tables
that are slated to become hubs. Hubs don’t carry parent/child relationships or resolve granularity issues.
Examining the Products table, we see both a CategoryID and a SupplierID. This will constitute a
LNK_Product Table, including the ProductID, SupplierID and CategoryID. In a true data warehouse we
would construct a surrogate key for this link table – however in this case the data model states that ProductID
is sufficient to represent the supplier and category (as indicated by OrderDetails). No surrogate key is
In cases of integration (across other sources), it may be necessary to put the surrogate key into multiple link
tables. Are there other parent child relationships that need a linkage? Yes, Employees has a recursive
relationship. To draw this out, we will construct a LNK_EMPLOYEE table, so that the "reportsto"
relationship can be handled through a link table. There are no more relationships that need to be resolved.
Now we can move on to satellite entities. An example of a link table is shown in Figure 3.
Figure 3: Link Table
The rest of the fields are subject to change over time – therefore, they will be placed into satellites. The
following tables will be created as satellite structures: Categories, Products, Suppliers, Order Details, Orders,
Customers, Shippers and Employees. The satellites contain only non-foreign key attributes. The primary key
of the satellite is the primary key of the hub with a LOAD_DATE incorporated. It is a composite key as
described in Part 1. In the interest of time and space only one example of a satellite is listed in Figure 4.
Figure 4: Satellite Example
The physical data model now appears as shown in Figure 5.
Figure 5: Physical Northwind Data Vault Model
If this is difficult to read, the full image is available on a PDF (in the ZIP file on the Inner Core Downloads)
at: http://www.coreintegration.com (sign up for the Inner Core – it’s free, then go to the downloads section).
This is the entire data model with all the hubs in light gray/blue, the links in red and the satellites in white.
This is style 1, with just a standard load date being utilized. In the interest of space the other styles will be
represented in a future article.
Populating a Data Vault
If the Auto Vault generation process is used, the views will be generated to populate the data structures, right
along with the generation of the structures themselves. In this case, the views have been generated. A sample
is provided of one of each of the hubs, links and satellites.
The hubs are inserts only. They record the business keys the first time the data warehouse sees them. They
do not record subsequent occurrences. Only the new keys are inserted into the hubs. The links are the same
way; they are inserts only – for only the rows that do not already exist in the links. The satellites are also
delta driven. The satellites insert any row that has changed from the source system perspective, providing an
audit trail of changes.
Another purpose of a Data Vault structure is to house 100 percent of the incoming data, 100 percent of the
time. It will be up to the reporting environments, and the data marts, to determine what data is "in error"
according to business rules. A Data Vault makes it easy to construct repeatable, consistent processes,
including load processes. The architecture provides another baby step in the direction of allowing dynamic
To load the hubs: select a distinct list of business keys with their surrogate keys, where the keys do not
already exist in the hub. See Figure 6.
Figure 6: Load Hubs
To load the links: select a distinct list of composite keys with their surrogates (if provided), where the data
does not already exist in the link (see Figure 7).
Figure 7: Load Links
To load the satellites: select a set of records, match or join to the business key (or by composite key if
possible), where the columns between the source and target have at least one change. Match only to the
"latest" picture of the satellite row in the satellite table for comparison reasons (see figure 8).
Figure 8: Load Satellites
The view is built to handle null comparisons as well as chop the comparison on text and image components
to only 2,000 characters. The comparison is extremely fast and is a short-circuit Boolean evaluation. These
views are run as Insert Into… Select * from . They satellite view is fast as long as partitioning is observed
along with the primary key.
Views work well when the source and a Data Vault are in the same instance of the relational database engine.
If different instances are utilized, then there are two suggested solutions: 1) stage the source data into the
warehouse target, so that the views can be used. 2) utilize an ETL tool to make the transfer and comparison
of the information. However, staging the information in the warehouse, and utilizing the views allows the
database engine to keep the data local, and in some cases take advantage of the highly parallelized operations
in the RDBMS engine (such as Teradata, for instance).
This article provides a look at implementing and building a Data Vault along with a sample Data Vault
structure that most everyone has access to. This simple example is meant to show that a Data Vault can be
built in an iterative fashion, and that it is not necessary to build the entire EDW in one sitting. It is also meant
to serve as an example for Part 1, showing that this modeling technique is effective and efficient. The next
series will dive into querying this style of data model and will discuss Style 3 – Load End Date of records vs.
the point-in-time satellite structure.
Dan Linstedt is chief technology office for Core Integration Partners. You can view his online presentation
about the Data Vault at the online trade show on www.dataWarehouse.com/tradeshow/ until January 15,
2003. If you are interested in more information please contact Linstedt at firstname.lastname@example.org or