Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Data Vault: The Next Evolution in Data Modeling, Part 1 By Dan Linstedt Published 11/29/02 Editor’s note: Dan Linstedt has a presentation that further develops the concept of the Data Vault as the Next Evolution in Data Modeling in the "Building a DW Infrastructure to Support BI Initiatives" online trade show now available at www.dataWarehouse. com/tradeshow/. The purpose of this article is to present and discuss a patent-pending technique called the Data Vault – the next evolution in data modeling for enterprise data warehousing. This is a highly technical paper and is meant for an audience of data modelers, data architects and database administrators. It is not meant for business analysts, project managers or mainframe programmers. It is recommended that there is a base level of knowledge in common data modeling terms such as table, relationship, parent, child, key (primary/foreign), dimension and fact. For too long we have waited for data structures to finally catch up with artificial intelligence and data mining applications. Most of the data mining technology has to import flat file information in order to join the form with the function. Unfortunately, volumes in data warehouses are growing rapidly and exporting this information for data mining purposes is becoming increasingly difficult. It simply doesn’t make sense to have this discontinuity between form (structure), function (artificial intelligence) and execution (the act of data mining). Marrying form, function and execution holds tremendous power for the artificial intelligence (AI) and data mining communities. Having data structures that are mathematically sound increases the ability to bring these technologies back into the database. The Data Vault is based on mathematical principles that allow it to be extensible and capable of handling massive volumes of information. The architecture and structure is designed to handle dynamic changes to relationships between information. A stretch of the imagination might be to one day encapsulate the data with the functions of data mining, hopefully to move towards a "self-aware" independent piece of information – but that’s just a dream for now. It is possible to form, drop and evaluate relationships between data sets dynamically. Thus changing the landscape of what is possible with a data model; essentially bringing the data model into a dynamic state of flux (through the use of data mining/artificial intelligence). By implementing reference architectures on top of a Data Vault structure – the functions that access the content may begin to execute in parallel and in an automated dynamic fashion. The Data Vault solves some of the enterprise data warehousing structural and storage problems from a normalized, best of breed perspective. The concepts provide a whole host of opportunities in applying this unique technology. Defining a Data Vault Definition: The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between third normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of enterprise data warehouses. The Data Vault is architected to meet the needs of the data warehouse, not to be confused with a data mart. It can double as an operational data store (ODS) if the correct hardware and database engine is in place to support it. The Data Vault can handle massive sets of granular data in a smaller, more normalized physical space in comparison to both 3NF and star schema. The Data Vault is foundationally strong. It is based on the
  2. 2. mathematical principles that support the normalized data models. Inside the Data Vault model are familiar structures that match traditional definitions of star schema and 3NF that include dimensions, many to many linkages and standard table structures. The differences lie in relationship representations, field structuring and granular time-based data storage. The modeling techniques built into the Data Vault have undergone years of design and testing across many different scenarios providing them with a solid foundational approach to data warehousing. A Brief History of Data Modeling for Data Warehousing 3NF was originally built in the early 1960s (Codd & Date) for online transaction processing (OLTP) systems. In the early 1980s, it was adapted to meet the growing needs of data warehouses. Essentially a data/ time stamp was added to the primary keys in each of the table structures. (See Figure 1.) In the mid to late 1980s, star schema data modeling was introduced and perfected. It was architected to solve subject-oriented problems including (but not limited to) aggregations, data model structural change, query performance, reusable or shared information, ease of use and the ability to support online analytical processing (OLAP). This single subject-centric architecture became known as a data mart. . Soon thereafter it too was adapted to multi-subject data warehousing as an attempt to meet the growing needs of enterprise data warehousing. The term for this is conformed data marts. Figure 1: Data Model Time Line Performance and other weaknesses of 3NF and star schema (when used within an enterprise data warehouse) began to show in the 1990s as the volume of data increased. The Data Vault is architected to overcome these shortcomings while retaining the strengths of 3NF and star schema architectures. Within the past year, this technique has been favorably received by industry experts. The Data Vault is the next evolution in data modeling because it’s architected specifically for data enterprise warehouses. The Problems of Existing Data Warehouse Data Modeling Architectures Each modeling technique has limitations when they are applied to enterprise data warehouse architecture. This is because they are an adaptation of a design rather than a design built specifically for the task. These limitations reduce usability and are constantly contributing to the "holy wars" in the data warehousing world. The following paragraphs are with respect to these architectures being applied as data warehouses, not as their respective original purposes. 3NF has the following issues to contend with including: time-driven primary key issues causing parent-child complexities, cascading change impacts, difficulties in near real-time loading, troublesome query access, problematic drill-down analysis, top-down architecture and unavoidable top-down implementation. Figure 2 is an original 3NF model adapted to data warehousing architecture. One particularly thorny problem is evident when a data/time stamp is placed into the primary key of a parent table. This is necessary in order to represent changes to detail data over time. The problem is scalability and flexibility. If an additional parent table is added, the change is forced to cascade down through all subordinate table structures. Also, when a new row is inserted with an existing parent key (the only field to change is the data/time stamp), all child rows must be reassigned to the new parent key. This cascading effect has a tremendous impact on the processes and the data model – the larger
  3. 3. the model the greater the impact. This makes it difficult (if not impossible) to extend and maintain an enterprise-wide data model. The architecture and design suffer as a result. Figure 2: Data/Time Stamped 3NF The conformed data mart also has trouble. It is a collection of fact tables that are linked together via primary/ foreign keys – in other words, a linked set of related star schemas. The problems this creates are numerous: isolated subject oriented information, possible data redundancy, inconsistent query structuring, agitated scalability issues, difficulties with fact table linkages (incompatible grain), synchronization issues in near real-time loading, limited enterprise views and troublesome data mining. While the star schema is typically bottom-up architecture, bottom-up implementation – the conformed data mart should be top-down architecture and bottom-up implementation. However, informal polling has shown that bottom-up architecture and bottom-up implementation appear to be the standard. One of the most difficult issues of a conformed data mart (or conformed fact tables) is getting the grain right. That means understanding the data as it is aggregated for each fact table and assuring that the aggregation will stay consistent for all time (during the life of the relationship) and the structure of each fact table will not change (i.e., no new dimensions will be added to either fact table). This limits design, scalability and flexibility of the data model. Another issue is the "helper table." This table is defined to be a dimension-to- dimension relationship link. Granularity is very important, as is the stability of the design of the dimension. This too limits design, scalability and flexibility of the data model. Figure 3: Conformed Data Mart If the granularity of the Revenue Fact is altered, then it is no longer the same (duplicate) fact table. By adding a dimension to one of the fact table,s the granularity frequently changes. It has also been suggested that fact tables can be linked together just because they carry the same dimension keys. This is only true if the facts are aggregated to the same granularity, which is an extremely difficult task to maintain as the system grows and matures.
  4. 4. The Importance of Architecture and Design for Enterprise Data Warehousing A data warehouse should be top-down architecture and bottom-up implementation. This allows the architecture to reach the maximum known knowledge boundaries while the implementation can be scope controlled which can facilitate fast delivery times. The implementation should, therefore, be designed as a plug-and-play set of tables without becoming a stovepipe upon delivery. The design and architecture of a data warehouse must be flexible enough to grow and change with the business needs, because the needs of today are not necessarily the needs of tomorrow. Our industry has a need for a formalized data modeling architecture and design that is capable of accurately representing data warehouses. The architecture must be a defined normalization for data warehousing versus a defined normalization for OLTP systems. For example, the defined normalization of OLTP is 1st, 2nd and 3NF; of course this includes 4th, 5th and maybe 6th normal form. Data warehousing today does not have such a structured or predefined normalization for data modeling. It is also apparent that it is no longer sufficient to have a haphazard normalization effort for an enterprise data warehousing architecture. Inconsistencies in modeling techniques lead to maintenance intensive implementations. The Data Vault is a defined normalization of data modeling for data warehouses. Its strength lies in the structure and usage from which the model is built. It utilizes some of the following data modeling techniques: many-to-many relationships, referential integrity, minimally redundant data sets and business function keyed information hubs. These techniques make the Data Vault data model flexible, expandable and consistent. The approach to building a Data Vault data model is iterative, which provides a platform for data architects and business users to construct enterprise data warehouses in a component-based fashion (see Bill Inmon’s article: "Data Mart Does Not Equal Data Warehouse," DM Review. May 1998. The Data Vault Components In order to keep the design simple, yet elegant, there are a minimum number of components, specifically the hub, link and satellite entities. The Data Vault design is focused around the functional areas of business with the hub representing the primary key. The link entities provide transaction integration between the hubs. The satellite entities provide the context of the hub primary key. Each entity is designed to provide maximum flexibility and scalability while retaining most of the traditional skill sets of data modeling expertise. Hub Entities Hub entities, or hubs, are a single table carrying at a minimum a unique list of business keys. These are the keys that the businesses utilize in every day operations. For example, invoice number, employee number, customer number, part number and vehicle identification number (VIN). If the business were to lose the key, they would lose the reference to the context, or surrounding information. Other attributes in the hub include: • Surrogate Key – Optional component, possibly a smart key or a sequential number. • Load Data/Time Stamp – Recording when the key itself first arrived in the warehouse. • Record Source – A recording of the source system utilized for data traceability. Figure 4 represents what a customer hub might look like. It is a visual representation of the structure within the database system. In this instance, the customer number is the primary business key, while the ID is a customer surrogate – assigned for join reasons and reduction of storage. For example, the requirement is to capture customer number across the company. Accounting may have a customer number (12345) represented in a numeric style and contracts may have the same customer number prefixed with an alpha (AC12345). In this case, the representation of the customer number in the hub would be alphanumeric and set to the maximum length to hold all of the customer numbers from both functional areas of business. The hub would have two entries: 12345 and AC12345, each would have their own record source – one from accounting and one from contracts. The obvious preference is to perform cleansing and
  5. 5. matching on these numbers to integrate them together. However that topic is out of scope for this paper. The hub’s primary key always migrates outward from the hub. Once the business is correctly identified through keys (say customer and account) the link entities can be constructed. Figure 4: Example Customer Hub Link Entities Link entities, or links, are a physical representation of a many-to-many 3NF relationship. The link represents the relationship or transaction between two or more business components (two or more business keys). It is instantiated (physically) in the logical model in order to add attributes and surround the transaction with context (this is discussed in the satellite entity description next). The link contains the following attributes: • Surrogate Key – Optional component, possibly a smart key or a sequential number. Only utilized if there are more than two hubs through this link, or the composite primary key might cause performance problems. • Hub 1 Key to Hub N Key – Hub keys migrated into the link to represent the composite key or relationship between two hubs. • Load Data/Time Stamp – Recording when the relationship/transaction was first created in the warehouse. • Record Source – A recording of the source system utilized for data traceability. Figure 5: Example Link Table Structure This is an adaptation of a many-to-many relationship in 3NF in order to solve the problems related to scalability and flexibility. This modeling technique is designed for data warehouses, not for OLTP systems. The application loading the warehouse must undertake the responsibility of enforcing one-to-many relationships if that is the desired result. Please note that some of the foundational rules for data modeling with the Data Vault will be listed at the end of this document. With just a series of hubs and links, the data model will begin to describe the business flow. The next component is to understand the context around
  6. 6. when, why, what, where and who constructed both the transaction and the keys themselves. For example, it is not enough to know what a VIN number is for a vehicle or that there is a driver number five out there somewhere. The customer is looking to know what the VIN represents (i.e., a blue Toyota pickup, 4WD, etc.) and that driver number five represents the name Jane and then they may want to know that Jane is the driver of this particular VIN. Satellite Entities Satellite entities, or satellites, are hub key context (descriptive) information. All of its information is subject to change over time; therefore, the structure must be capable of storing new or altered data at the granular level. The VIN number should not change, but if a wrecking crew rebuilds the Toyota – chops the top and adds a roll bar, it may not be a pickup anymore. What if Jane sells the car to someone else, say driver number six? The satellite is comprised of the following attributes: • Satellite Primary Key: Hub or Link Primary Key – Migrated into the satellite from the hub or link. • Satellite Primary Key: Load Data/Time Stamp – Recording when the context information is available in the warehouse (the new row is always inserted). • Satellite Optional Primary Key: Sequence Surrogate Number – Utilized for satellites that have multiple values (such as a billing and home address) or line item numbers used to keep the satellites subgrouped and in order. • Record Source – A recording of the source system utilized for data traceability. Figure 6: Customer Name Satellite In Figure 6, we are able to show the changes to customer name over time. The figure also depicts the different source systems from which the rows originated. This allows the warehouse to store the information at the most granular level, while maintaining an audit trail in the warehouse for traceability reasons. Notice the load_dts is part of the composite primary key, because of this the data is ordered on a time basis – and is referenced through the utilization of the customer id surrogate key. The satellite is most closely related to a Type 2 dimension as defined by Ralph Kimball. It stores deltas at a granular level; its function is to provide context around the hub key. For example, the fact that VIN 1234567 represents a blue Toyota truck today and a red Toyota truck tomorrow. Color may be a satellite for automobile. Its design relies on the mathematical principles surrounding reduction of data redundancy and rate of change. For instance, if the automobile is a rental, the dates of availability/rented might change daily which is much faster than the rate of change for color, tires or owner. The issue that the satellite solves is defined as follows: An automobile dimension may contain 160+ attributes; if the color or tires change then all 160+ attributes must be replicated into a new row (if utilizing a Type 2 dimension). Why replicate data when the rest of the attributes are changing at slower rates of change? If utilizing a Type 1 or Type 3 dimension it is possible to loose partial or complete historical trails. In this case, the data modeler should construct at a minimum two satellites: dates of availability and maintenance/parts. If the customer who rents the auto the first day is Dan
  7. 7. and the second day is Jane, then it is the link’s responsibility to represent the relationship. The data modeler might attach one or more satellites on the link representing dates rented (from/to), condition of vehicle and comments made by the renter. Figure 7: Sample Point-In-Time Table The point-in-time table is a satellite derivative. It is built to assist queries in their effort to find information at specific points in time. It is the system of record for the historical pictures that are gathered by the different satellites. It is typically only built if there are two or more satellites surrounding a hub. With one satellite, a correlated sub-query will work just fine. This table can also shed light on rates of change comparisons and from-to date stamping across the satellites. In other words some of the statistics which are produced by this table can provide insight into how often different information is changing. A note about date/time stamping; while date/time stamps are shown here in the load date fields, it is possible to use a surrogate key (numeric) that points to a time table. This may shorten all the tables’ width and will allow greater flexibility in resolving date issues. Also note, that a date/time stamp is usually a load-date/time – to show when the information arrives in the warehouse however it is also possible to utilize source system record creation dates but only if available. The issue here is consistency. All the data in the warehouse must be consistently stamped in order to achieve a system of record that is understandable to the business community. Building a Data Vault The Data Vault should be built as follows: 1. Model the Hubs. This requires an understanding of business keys and their usage across the designated scope. 2. Model the Links. Forming the relationships between the keys – formulating an understanding of how the business operates today in context to each business key. 3. Model the Satellites. Providing context to each of the business keys as well as the transactions (links) that connect the hubs together. This begins to provide the complete picture of the business. 4. Model the Point-in-Time Tables. This is a satellite derivative, of which the structure and definition is outside the scope of this document (due to space constraints). There are methods for representing external sources such as flat files, excel feeds and user defined tab delimited files – due to time and space constraints, these items will not be discussed here. No matter what type of source, all the structures and modeling techniques apply. Reference rules for Data Vaults:
  8. 8. • Hub keys cannot migrate into other hubs (no parent/child-like hubs). To model in this manner breaks the flexibility and extensibility of the Data Vault modeling technique. • Hubs must be connected through links. • More than two hubs can be connected through links. • Links can be connected to other links. • Links must have at least two hubs associated with them in order to be instantiated. • Surrogate keys may be utilized for hubs and links. • Surrogate keys may not be utilized for satellites. • Hub keys always migrate outward. • Hub business keys never change, hub primary keys never change. • Satellites may be connected to hubs or links. • Satellites always contain either a load date/time stamp or a numeric reference to a standalone load date/time stamp sequence table. • Standalone tables such as calendars, time, code and description tables may be utilized. • Links may have a surrogate key. • If a hub has two or more satellites, a point-in-time table may be constructed for ease of joins. • Satellites are always delta driven, duplicate rows should not appear. • Data is separated into satellite structures based on: 1) type of information 2) rate of change. These simple components hub, link and satellite combine to form a Data Vault. A Data Vault can be as small as a single hub with one satellite or as large as the scope permits. The scope can always be modified at a later date, and scalability is not an issue (nor is granularity of the information). A data modeler can convert small components of their existing data warehouse model to a Data Vault architecture one piece at a time. This is because the changes are isolated to the hub and satellites. The business (how functional areas of business interact) is represented by the links. In this manner the links can be end dated, rebuilt, revised and so on. Solving the Pain of Data Warehouse Architectures 3NF and star schema when used for enterprise data warehousing may cause pain to the business because they were not built originally for this purpose. There are issues surrounding scalability, flexibility and granularity of data, integration and volume. The volume of information that warehouses are required to store today is exponentially increasing every year. CRM, SCM, ERP and all the other large systems are forcing volumes of information to be fed to the warehouses. The current data models based on 3NF or star schema are proving difficult to modify, maintain and query, let alone backup and restore. In the example previously provided, if the scope is to warehouse vehicle data and the corresponding attributes over time – that is a single Data Vault comprised of a single hub with a few satellites. A year later, if the business wants to warehouse contracts with that vehicle, hubs and links can be added easily. No worries about granularity. This type of model extends upward and outward (bottom-up implementation, top- down architecture). The end result is always foundationally strong and can be delivered with an iterative development approach. Another example is the power of the link entity. Suppose a company sells products today, has a product hub, an invoice hub and a link between the two. Then the company decides to sell services. The data model can establish a new services hub, end date the entire set of product links and start a new link between services and invoices. No data is lost and all data going back over time is preserved – matching the business change. This is only one of many different possibilities for handling this situation. Volume causes query issues, particularly with the structures of star schema but not so much with 3NF. Volume is breaking queries that are after the information in conformed dimensions and conformed fact tables. Partitioning is often required and the structures are continually reworked to provide additional granularity to the business users. This promotes a management and maintenance nightmare. Reloading an ever-changing star is difficult – let alone attempting to perform this with volume (upwards of 1 Terabyte for instance). The Data Vault is rooted in the fundamentals of mathematics that are squarely behind the normalized data model. Reduction of redundancy and accounting for rates of change among data sets
  9. 9. contribute to increased performance and easier maintenance. The Data Vault architecture is not limited to fitting on a single platform. The architecture allows for a distributed yet interlinked set of information. Data warehouses must frequently deal with the statement: what I (the user) will give you won’t ever come from the source system. Then they proceed to provide a spreadsheet with their daily maintained interpretation of the information. In other words: I (the customer) want to see all VIN numbers that start with X rolled up under label BIG TRUCKS. What the Data Vault provides for this is called a user grouping set. It’s another hub (label Big Trucks) with a satellite describing which VIN numbers roll under this label and a link to the VIN numbers themselves. In this manner, the original data from the source system are preserved while the query tools can view the information in a manner appropriate to the users’ needs. When all is said and done a data warehouse is successful if it meets the users’ needs. The Foundations of the Data Vault Architecture The architecture is rooted in the mathematics of reduction of redundancy. The satellites are set up to store only the deltas, or changes, to the information within. If a single satellite begins to grow to quickly, it is very easy to create two new satellites and run a delta splitter process; a process that will split the information into the two new satellites, each process running another delta process before inserting the new rows. This process can keep the rates of duplication of columnar data down to a minimum. It equates to utilizing less storage. Satellites by nature can be very long and, in most cases, are geared to be narrow (not many columns). In comparison, Type 2 dimensions may replicate data across many columns, making copies of information over and over again as well as generating new keys. The hubs store a single instance of the business key. The business keys most often have a very low propensity relative to change. Because of this, surrogate keys mapped to business keys (if surrogates are utilized) are a one-to-one mapping and never change. The primary key of the hub (regardless of the type – business or surrogate) is the only component of information replicated across the satellites and links. Because of this the satellites are always tied directly to the business key. In this manner satellites are relegated to describing the business key at the most granular level available. This provides a basis for "context" about a business key to be developed. Another unique result of the Data Vault is the ability to represent relationships, dynamically. Relationships are founded in a link structure the first time the business keys are "associated" in incoming source data. This relationship exists until it’s either end dated (in a satellite) or deleted from the data set completely. The fact that this relationship is represented in this manner opens up new possibilities in the area of dynamic relationship building. If new relationships between two hubs (or their context) are discovered as a result of data mining, new links can be formed automatically. Likewise, link structures and information can be end dated or deleted at the time when they are no longer relevant. For example: a company is selling products today and has a link table between products and invoices. Tomorrow, they begin selling services. It may be as simple as constructing a service hub and a link between invoices and services – then end dating all the relationships between the products and invoices. In this example, the process of changing the data model can begin to be programmatically explored. Which if automated, would dynamically change and adapt the structure of the data warehouse to meet the needs of the business users. Rates of change and reduction of redundancy along with the flexibility of potentially unlimited dynamic relationship alteration form a powerful foundation. These items open doors in the application of the Data Vault structures to many different purposes. Possible Applications of the Data Vault As a result of the foundations, many different applications of the Data Vault may be considered. A few of these are already in the throws of development. A small list of these possibilities is below:
  10. 10. • Dynamic Data Warehousing – Based on dynamic automated changes made to both process and structure within the warehouse. • Exploration Warehousing – Allowing users to play with the structures of the data warehouse without losing the content. • In-Database Data Mining – Allowing the data mining tools to make use of the historical data, and to better fit the form (structure) with the function of data mining/artificial intelligence. • Rapid Linking of External Information – An ability to rapidly link and adapt structures to bring in external information and make sense of it within the data warehouse without destroying existing content. The business of data warehousing is evolving – it must move in order to survive. The architecture and foundations behind what data warehousing means will continue to change. The Data Vault overcomes most of the problems and limitations of the past and stands ready to meet the challenges of the future. Data Vault: The Next Evolution in Data Modeling, Part 2 By Dan Linstedt Published 12/06/02 The purpose of this two-part series is to present and discuss a patent-pending technique called a Data Vault – the next evolution in data modeling for enterprise data warehousing. The audience of this paper should be the data modelers who wish to construct a Data Vault data model. This article focuses on a specific example: The Microsoft SQLServer 2000 Northwind Database. It is suggested that for the purposes of this discussion, the reader obtains at a minimum, a trial copy of the SQLServer 2000 database engine. Please read Part 1, which defines the Data Vault architecture. This will provide the context on what the data model is and how it fits into business. Let’s consider this for a moment: suppose it is possible to reverse engineer a data model into a warehouse. What would that mean for a data warehousing project? Suppose it could be done in an automated fashion, would that help or hurt? What if the only consideration necessary to make is how to integrate different aspects of the generated data models? These and many more questions come to mind when beginning to consider the automation of data modeling for data warehousing, particularly when the consideration involves mechanized engineering. For our purposes, having this functionality to produce a baseline Data Vault would be of tremendous help. The Northwind data model was converted both by hand and through an automated fashion. When the two data models were compared they only had minor differences. Further examination showed that hand conversion opened up the possibilities for errors in link tables where the automated converter kept the links clean. Some of the most important items to mechanizing the process are naming conventions, abbreviation conventions and specification of primary/foreign keys. What’s important here is that this is a baby step into the application of "dynamic data warehousing" or dynamic model changes (please see my other article: Bleeding Edge Data Warehousing – due out in the Journal of Data Warehousing Fall 2002). It also provided a data model in 10 minutes (for this particular example), when it took roughly two hours to convert it by hand. It then took an additional 20 minutes to adjust the model slightly and implement it. Keep in mind, this is a small data model and all that is proposed is auto-vaulting of one OLTP data model at a time. The automated process isn’t smart enough yet to integrate end-resulting Data Vault data models. The DDL is available on a Web link: http://www.coreintegration.com. Sign in to our free online community, the Inner Core, and click on Downloads, select Data Warehousing, then find the zip file titled: DataVault2DDL.zip (for this series). The DDL and the views are built for Microsoft SQLServer 2000. Feel
  11. 11. free to convert them to a database of your choice. The automated mechanism is not available today – it’s still in an experimental phase. The DDL contains the tables for a Data Vault and the views to populate the structure both initially and with changes. Please keep in mind; this is not a "perfect Vault" and has not been conditioned to be the same quality of data model as delivered to the customer. This is meant as an example only, for trial purposes. Feel free to contact me directly with questions or comments. Examining an OLTP 3NF Model for Conversion Some OLTP data models in 3NF are easier to convert than others. However, there are some distinctive properties which make the conversion process easy. Here are a few items to look for: 1. How well does the data model adhere to standard naming conventions? This will have an effect on integration of fields. If the fields (attributes) are named the same across the model, then the resulting Data Vault will be easier to build as well as easier to identify which components have been translated. 2. How many independent tables are in the data model? Independent tables usually don’t integrate in a Data Vault very well. It is a stretch to integrate the table through field name matching. Normally these independent tables are copied across into a Data Vault as standalone tables – until integration points can be found. 3. Have primary and foreign key relationships been defined? If referential integrity has been turned off in the data model, it will be exceedingly difficult to create a Data Vault model. It is near impossible to automatically convert it, however, through some hard work and rolling up of the sleeves (digging into business requirements) it can be done effectively. 4. Does the model utilize surrogate keys instead of natural keys? The converted data model favors natural keys over surrogate keys. The models that are converted by hand require that the data modeler understands the business well enough to identify the business keys (natural keys) and their mapping to the surrogate keys. 5. Does the model match the business requirements for the data warehouse? If the requirements are to integrate or consolidate data across the OLTP system (such as a single customer view or a single address view) then the process of converting to a Data Vault may be a little more difficult. The process may require cross-mapping data elements for integration purposes. 6. Can the information be separated by class or type of data? In other words, can all the addresses be put in a single table, all the parts in another table, all the employees, etc? Separating the classes of data helps with the integration effort. This is usually a manual cross-mapping and regrouping of attributes. 7. How quickly do certain attributes change? A Data Vault likes to separate data by rates of change. It is easier to model a Data Vault if an understanding of the rates of change of the underlying information is known. These are just a set of suggested items to consider before converting the data model. They are by no means a complete list. First and foremost, the data warehouse data model should always follow the business requirements, regardless of how the base-line or initial model is generated. It is suggested that a scorecard approach be developed. Over time, these items will be on a scale of difficulty, when that happens – it will provide a good guideline as to the "convertibility" of a particular data model. Future series will cover migrating conformed data marts and other types of adapted 3NF EDW to a Data Vault schema. The Northwind Database
  12. 12. Northwind is built by Microsoft, and is installed on every Microsoft SQLServer 2000 database. It is freely accessible with sample data. The data model is shown in Figure 1. In this model, the first thing to notice is the use of non-standard data types: bit, ntext, image, money. These don’t port very well to other relational databases. This is important to resolve because most of the data warehouses are not built on the same database engine as their OLTP counterparts. In this case a Data Vault will be built on the same RDBMS engine. Another item that pops out of the data model is the recursive relationship. Immediately, this should signal a necessary change to the data model. The naming conventions appear consistent across the model. ID is used synonymously with primary keys, primary and foreign keys are defined, there are no independent tables and the model does appear to use some surrogate and some natural keys. For the sake of discussion, the business requirements are to house all of the data in the warehouse and store only incremental changes to the data over time. The attributes could be classed out (normalized) further if desired, items such as address, city, region and postal code can all be grouped. Do certain attributes change faster than others? From looking at the model, the two tables with the most changes might be orders and order details. There really isn’t a method that will help the discovery of rapidly changing elements in this model. Normally rapid changing elements are either indicated by business users or provided in audit trails, usage logs or through time stamps on the data itself. In this case, none of these are present. Figure 1: The Northwind Physical 3NF Data Model
  13. 13. The Process of Modeling a Data Vault In order to keep the design simple, yet elegant, there are a minimum number of components, specifically the hub, link and traditional skill sets of data modeling expertise. These were defined in Part 1. Please refer to the first article for definitions and table structure setup. This section will discuss the process of converting the above data model to an effective Data Vault. The steps for a single model conversion without integration are as follows: 1. Identify the business keys and the surrogate key groupings, model the hubs. 2. Identify the relationships between the tables that must be supported, model the links. 3. Identify the descriptive information, model the satellites. 4. Regroup the satellites (normalize) by rates of change or types of information. To address more than one model start with the business identified "master system." Build the first data model and then incrementally map other data models and data elements into the single unified view of information. There are three styles to load dates in the EDW Data Vault architecture and before modeling can begin. It is wise to chose a style that suites your needs. The styles are as follows: 1. Standard Load Date as indicated by Part 1 and 2 of this article. This is easy to load, difficult to query. For more than two satellite tables off a hub it may require an additional "picture table" or point- in- time satellite to house the delta changes for equi-joins. 2. Load Date data type altered to be an integer reference to a Load Table where the date is stored. Integer’s reference is a stand- alone foreign key to a load table and can be used if date logic is not desired. Be aware that this can cause difficulties in reloading, and resequencing the keys in the warehouse. This is not a recommended practice/style. 3. Load End Date is added to all the satellites. Rows in satellites are end dated as new rows are inserted. This can help the query perspective and at the same time can make loading slightly more complex. Using this style, it may not be necessary to construct a picture table (point-in-time satellite). Select the style that best suits the business needs and implement it across the model. Part of the Data Vault modeling success is consistency. Stay consistent with the style that’s chosen and the model will be solid from a maintenance perspective. Hub Entities Since the hubs are a list of business keys it is important to keep them together with any surrogate keys (if surrogates are available). Upon examination of the model we find the following business key/surrogate key groupings (the examination included unique indexes and a data query): • Categories: CategoryName is the business key, CategoryID is the surrogate key. This will constitute a HUB_Category table. • Products: ProductName is the business key, ProductID is the surrogate key. This will constitute a HUB_Product table. • Suppliers: SupplierName is the business key, SupplierID is the surrogate key. This will constitute a HUB_Supplier table. • Order Details: has no business key, and cannot "stand on its own." Therefore, it is NOT a hub table. • Orders: Appears to have a surrogate key – which may or may not constitute a business key (depends on the business requirements). Upon further investigation we find many foreign keys. The table appears to be transactional in nature which makes it a good candidate for a link rather than a hub table.
  14. 14. • Shippers: CompanyName is a business key, and ShipperID is the surrogate key. Shippers will constitute a HUB_Shippers Table. If the business requirements state that an integration of "companies" is required, then the CompanyName field in Shippers can be utilized. However, if the business requirements state that shippers must be kept separate, then CompanyName is not descriptive enough and should be changed to ShipperName in order to keep with the current field naming conventions. • Customers: CompanyName is the business key, and CustomerID is the surrogate key. Customers will constitute a HUB_Customers table. Again, if integration is desired, then maybe an entity called: HUB_Company would be constructed (to integrate Customers and Shippers). • CustomerCustomerDemo: Has no real business key and cannot stand on its own; therefore it will be a link table. • CustomerDemographics: Upon first glance, CustomerDesc appears to be the business key with CustomerTypeID being the surrogate key; however, this could also be constructed as a satellite of Customer. Remember that the warehouse is meant to capture the source system data, not enforce the rules of capture. For the purposes of this discussion, HUB_CustomerDemographics will be constructed. • Employees: EmployeeName appears to be the best business key, with EmployeeID being the surrogate key. This will constitute a HUB_Employee table. • EmployeeTerritories: There appears to be no real business key here, this will not constitute a HUB table, most likely it will become a link table. • Territories: TerritoryDescription appears to be the business key, with TerritoryID being the surrogate key. This will constitute a HUB_Territories table. • Region: RegionDescription is clearly the business key, RegionID is the surrogate key. This table will constitute a HUB_Region table. Once the analysis has been done for each of the table structures, we can assemble the list of hub tables that will be built: Hub_Category, Hub_Product, Hub_Supplier, Hub_Shippers, Hub_Customer, Hub_CustomerDemographics, Hub_Employee, Hub_Territories. There are a couple of questionable items which depending on the business rules may have their structure integrated. Remember that the hub structures are all very similar. An example of the Hub_Category is given in Figure 2. Figure 2: Example of Hub_Category Now that we have the hub structures, we can move on to the links. The function of the hubs is to integrate and centralize the business around the business keys. Link Entities
  15. 15. The links represent the business processes, the glue that ties the business keys together. They describe the interactions and relationships between the keys. It is important to realize that the business keys and the relationships that they contain are the most important elements in the warehouse. Without this information, the data is difficult to relate. Typically transactions and many-to-many tables constitute good link tables. Along with that, any table that doesn’t have a respective business key becomes a good link entity. Tables with a single attribute primary key mostly make a good hub Table, however the requirement is still for a business key. In the case of Orders, a business key does not exist. The link tables of our model are as follows: • Order Details: Many-to-many table, excellent link table. LNK_OrderDetails will be constituted. • Orders: Many to many, parent transaction of Order Details, excellent link table. LNK_Orders will be constituted. However, please note: It may or may not be appropriate to constitute hub_Orders as a hub table depending on the business – and it’s desire to track Order ID. In this case, we will keep it as a link table. • CustomerCustomerDemo: Many-to-many table, excellent link table. LNK_CustomerCustomerDemo will be constituted. • EmployeeTerritories: Many-to-many table, excellent link table. LNK_EmployeeTerritories will be constituted. Did we get all the linkages? No. Look again. There are some parent/child foreign key relationships in tables that are slated to become hubs. Hubs don’t carry parent/child relationships or resolve granularity issues. Examining the Products table, we see both a CategoryID and a SupplierID. This will constitute a LNK_Product Table, including the ProductID, SupplierID and CategoryID. In a true data warehouse we would construct a surrogate key for this link table – however in this case the data model states that ProductID is sufficient to represent the supplier and category (as indicated by OrderDetails). No surrogate key is necessary. In cases of integration (across other sources), it may be necessary to put the surrogate key into multiple link tables. Are there other parent child relationships that need a linkage? Yes, Employees has a recursive relationship. To draw this out, we will construct a LNK_EMPLOYEE table, so that the "reportsto" relationship can be handled through a link table. There are no more relationships that need to be resolved. Now we can move on to satellite entities. An example of a link table is shown in Figure 3. Figure 3: Link Table Satellite Entities The rest of the fields are subject to change over time – therefore, they will be placed into satellites. The following tables will be created as satellite structures: Categories, Products, Suppliers, Order Details, Orders, Customers, Shippers and Employees. The satellites contain only non-foreign key attributes. The primary key
  16. 16. of the satellite is the primary key of the hub with a LOAD_DATE incorporated. It is a composite key as described in Part 1. In the interest of time and space only one example of a satellite is listed in Figure 4. Figure 4: Satellite Example The physical data model now appears as shown in Figure 5. Figure 5: Physical Northwind Data Vault Model
  17. 17. If this is difficult to read, the full image is available on a PDF (in the ZIP file on the Inner Core Downloads) at: http://www.coreintegration.com (sign up for the Inner Core – it’s free, then go to the downloads section). This is the entire data model with all the hubs in light gray/blue, the links in red and the satellites in white. This is style 1, with just a standard load date being utilized. In the interest of space the other styles will be represented in a future article. Populating a Data Vault If the Auto Vault generation process is used, the views will be generated to populate the data structures, right along with the generation of the structures themselves. In this case, the views have been generated. A sample is provided of one of each of the hubs, links and satellites. The hubs are inserts only. They record the business keys the first time the data warehouse sees them. They do not record subsequent occurrences. Only the new keys are inserted into the hubs. The links are the same way; they are inserts only – for only the rows that do not already exist in the links. The satellites are also delta driven. The satellites insert any row that has changed from the source system perspective, providing an audit trail of changes. Another purpose of a Data Vault structure is to house 100 percent of the incoming data, 100 percent of the time. It will be up to the reporting environments, and the data marts, to determine what data is "in error" according to business rules. A Data Vault makes it easy to construct repeatable, consistent processes, including load processes. The architecture provides another baby step in the direction of allowing dynamic structure changes. To load the hubs: select a distinct list of business keys with their surrogate keys, where the keys do not already exist in the hub. See Figure 6. Figure 6: Load Hubs To load the links: select a distinct list of composite keys with their surrogates (if provided), where the data does not already exist in the link (see Figure 7). Figure 7: Load Links
  18. 18. To load the satellites: select a set of records, match or join to the business key (or by composite key if possible), where the columns between the source and target have at least one change. Match only to the "latest" picture of the satellite row in the satellite table for comparison reasons (see figure 8). Figure 8: Load Satellites The view is built to handle null comparisons as well as chop the comparison on text and image components to only 2,000 characters. The comparison is extremely fast and is a short-circuit Boolean evaluation. These views are run as Insert Into… Select * from . They satellite view is fast as long as partitioning is observed along with the primary key. Views work well when the source and a Data Vault are in the same instance of the relational database engine. If different instances are utilized, then there are two suggested solutions: 1) stage the source data into the warehouse target, so that the views can be used. 2) utilize an ETL tool to make the transfer and comparison of the information. However, staging the information in the warehouse, and utilizing the views allows the database engine to keep the data local, and in some cases take advantage of the highly parallelized operations in the RDBMS engine (such as Teradata, for instance). Summary This article provides a look at implementing and building a Data Vault along with a sample Data Vault structure that most everyone has access to. This simple example is meant to show that a Data Vault can be built in an iterative fashion, and that it is not necessary to build the entire EDW in one sitting. It is also meant to serve as an example for Part 1, showing that this modeling technique is effective and efficient. The next series will dive into querying this style of data model and will discuss Style 3 – Load End Date of records vs. the point-in-time satellite structure. Dan Linstedt Dan Linstedt is chief technology office for Core Integration Partners. You can view his online presentation about the Data Vault at the online trade show on www.dataWarehouse.com/tradeshow/ until January 15, 2003. If you are interested in more information please contact Linstedt at dlinstedt@coreintegration.com or
  19. 19. at Core Integration Partners, 455 Sherman St., Suite 207, Denver, CO USA, 80203. You may also check out the Web site at http://www.coreintegration.com. Copyright © 2002 DM Review, a division of Thomson Financial Media. All rights reserved.