Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

4

Share

Download to read offline

Data Vault: Data Warehouse Design Goes Agile

Download to read offline

Data Warehouse (especially EDW) design needs to get Agile. This whitepaper introduces Data Vault to newcomers, and describes how it adds agility to DW best practices.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Data Vault: Data Warehouse Design Goes Agile

  1. 1. DecisionLab.Net business intelligence is business performance ___________________________________________________________________________________________________________________________________________________________________________________ ____________________________________________________________________________________________________________________________________________________________________________________ DecisionLab http://www.decisionlab.net dupton@decisionlab.net direct760.525.3268 http://blog.decisionlab.net Carlsbad,California,USA Data Vault: Data Warehouse Design Goes Agile
  2. 2. __________________________________________________________________________________________________________________________________________________________________________________ Page 2 of 13 Whitepaper Data Vault: Data Warehouse Design Goes Agile by daniel upton data warehouse modeler and architect certified scrum master DecisionLab.Net business intelligence is business performance dupton@decisionlab.net http://www.linkedin.com/in/DanielUpton Without my (the writer’s) explicit written permission in advance, the only permissible reproduction or copying of this written material is in the form of a review or a brief reference to a specific concept herein, either or which must clearly specify this writing’s title, author (me), and this web address http://www.slideshare.net/DanielUpton/lean-data-warehouse-via-data-vault . For permission to reproduce or copy any of this material other than what is specified above, just email me at the above address.
  3. 3. __________________________________________________________________________________________________________________________________________________________________________________ Page 3 of 13 Open Question: When we begin considering a new Data Warehouse initiative, how clear is the scope, really? If weintend to design Data Marts, and we have no specified need for a data warehouseeither to become a systemof record, or to supportMaster Data Management (MDM), then we may chooseto Dr. Ralph Kimball’s Data WarehouseBus architecture, designing a library of conformed (standardized, re-usable) dimension and fact tables for deployment into a series of purpose-builtdata marts. Under these requirements, wemay have no specific need for an Inmon stylethird-normalform (3nf) EnterpriseData Warehouse(EDW) in general, or for a Data Vault in particular. In other cases, however, because sometimes data warehousedata outlives its corresponding sourcedata inside a soon-to-retireapplication database, then, like it or not, a data warehousemay, as Bill Inman remind us, assumea systemof record role for its data. Whereas the Kimball Bus architecture’s tables are often not related via key fields, and in fact may not be populated at all until deployment fromthe Bus into a specific-needs Data Mart, Kimball adherents rarely asserta system-of-record rolefor their solutions. But, supposewedo determine that our required solution either does need to assumea systemof record role, or perhaps that it mustsupportMaster Data Management. As such, wemay elect to design a fully functionalEDW, rather than Kimball’s DW Bus, so that the EDW itself, and not justits dependent data marts, is a working, populated database. Now, knowing that the creation of a classic EDW, with its requirement for an up-front, enterprise-widedesign, is a challenge with today’s expectations for rapid delivery, some may be curious aboutnew design methodologies offer ways to accelerate EDW Design. Data Vault, a data warehousemodeling method with a substantialfollowing in Denmark, and a growing basein the U.S., offers specific and important benefits. In order to set expectations early about Data Vault, readers mustunderstand that, somewhatunlike a traditional EDW, and utterly unlike a star-schema, a Data Vault (not to be confusedwithBusiness DataVault, whichis not addressedinthis article) cannot serve as an efficient presentationlayer appropriate for direct queries. Rather, it is morelike a historic enterprise data staging repository that, with additional downstreamETL, will supportnotonly star-schema, reporting and data mining, but also master data management, data quality and other enterprise data initiatives.
  4. 4. __________________________________________________________________________________________________________________________________________________________________________________ Page 4 of 13 Data Vault Benefits:  Benefit #1: Allows for loading of a history-tracking DW with little or none of the typical extraction, transformation and loading (ETL) transformations that, oncethey are finally figured out, would otherwisecontain subjective-interpretations of the data and which purportedly enhancethe data and prepareit for reporting or analytics. o In my view, this is almost enough of a benefit all by itself. As such, in my introduction that follows, I will focus on proving this point. o Agile Win: Confidently loading a DW without having to already know the fine details of business rules and requirements and the resulting transformation requirements means that loading of historicaland incremental data could get accomplished before the firsttarget databasedesign (3nf EDW or Data Mart) is complete.  Benefit #2: Insofar as Data Vaultprescribes a very generic downstream‘de-constructing’ of OLTP tables, thesede- constructing transformations can beautomated and so can it’s associated early-stageETL into Data Vault. Since, as you’ll soon see, Data Vault causes a substantial increasein the number of tables, this automation potential is a substantialbenefit. o Agile Win: Automated initial design and loading, anyone?  Benefit #3: Due to Data Vault’s generic design logic, it’s use of surrogatekeys (moreon this soon), and it’s prescription to avoid subjective-interpretivetransformations, it’s reasonableto quickly load a Data Vaultjustwith the needed subset of tables. o Agile Win: More frequent releases. Quickly design for, and load, only the data needed for the next release. Use the samegeneric design to load other tables when those User Stories fromthe ProductBacklog get placed into a Sprint. In the remainder of this article, I will provide a high level introduction to Data Vault, with primary emphasis on how it achieves Benefit #1.
  5. 5. __________________________________________________________________________________________________________________________________________________________________________________ Page 5 of 13 High-Level IntroductiontoDataVault Methodology: We begin with a simple OLTP databasedesign for clients purchasing products froma company’s stores. For simplicity, I include only a minimum of fields. In the diagrams, ‘BK’ means business key, ‘FK’ means foreign key. Refer to DiagramA below. As is common, this simple OLTP schema does not use surrogatekeys. If a client gets a new email address, or a productgets a new name, or a city’s re-mapping of boundary lines suddenly places an existing storein a new city, new values would overwritethe old values, which would then be lost. Of course, in order to preservehistory, history-tracking surrogatekeys are commonly used by practitioners of both Bill Inmon’s classic third-normalform(3nf) EDW design, as well as Dr. Ralph Kimball’s Star Schema method, but both of these methods prescribesurrogatekeys within the context of data transformations thatalso include subjectiveinterpretation (herein simply ‘subjectivetransformation’) in order to cleanse or purportedly enhance the data for the purposes of integration, reporting, or analytics. Data Vault purists claim that any such subjectivetransformation of line-of-business data introduces inappropriatedistortion to it, thereby disqualifying the Data Warehouseas systemof record. Data Vault, importantly, provides a unique way to track historical changes in sourcedata while eliminating most, or all, subjectivetransformations such as field renaming, selective data-quality filters, establishment of hierarchies, calculated fields, and target values. Although analytics-driven, subjectivetransformations can still be applied, they are applied downstreamof the Data Vault EDW, as subsequenttransformations for loads into data marts designed to analyze specific processes. Back upstream, the Data Vault accomplishes historic change-tracking using a generic table-deconstructing approach that I will now describe. Before beginning, I recommend against too-quickly comparisons this method others, like star-schema design, which servedifferent needs.
  6. 6. __________________________________________________________________________________________________________________________________________________________________________________ Page 6 of 13 DiagramA: Simple OLTP schema (data sourcefor a Data Vault)
  7. 7. __________________________________________________________________________________________________________________________________________________________________________________ Page 7 of 13 Fundamentally, Data Vault prescribes three types of tables: Hubs, Satellites, and Links. The diagram’s Client table as a good example. Hubs work according to the following simplified description: Hub Tables:  Define the granularity of an entity (eg. product), and thus the granularity of non-key attributes (eg. productdescription) within the entity.  Contain a new surrogateprimary key (PK), as well as the sourcetable’s business key, which is demotes fromits PK role. Satellite Tables:  Contain all non-key fields (attributes), plus a set of date-stamp fields  Contain, as a Foreign Key (FK), the Hub’s PK, plus the load date-time stamps.  Have a defining, dependent entity relationship to one, and only one, parent table.  Whether that parent table is a Hub or Link, the Satellite holds the non-key fields fromthe parenttable.  Although on initial loads, only one Satellite row will exist for each corresponding Hub row, whenever a non-key attribute change(eg. a client’s email address changes) upstreamin the OLTP schema (often accomplished up there with a simple over-write), a new row will be added only to the Satellite, and not the Hub, which is why many Satellite rows relate to one Hub row. So, in this fashion, historic changes within sourcetables are gracefully tracked in the EDW. Notice, in DiagramB that, among other tables, the Client_h_s Satellite table is dependent to the Client_h Hub table, but that, at this stage in our design, the Client_h Hub is not yet related to Order_h Hub. When we add Links, thoserelationships will appear. But first, have a look at the tables, the new location of existing fields, and the various added date-time stamps.
  8. 8. __________________________________________________________________________________________________________________________________________________________________________________ Page 8 of 13 DiagramB: Hubs and Satellite in a partially-designed Data Vault schema
  9. 9. __________________________________________________________________________________________________________________________________________________________________________________ Page 9 of 13 Link Tables:  Refer to Diagram C  Relate exactly two Hub tables together.  Contain, now as non-key values, the primary keys of the two Hubs, plus its own surrogatePK.  As with an ordinary association table, a Link is a child to two other tables and, as such, is able to gracefully handle relative changes in cardinality between the two tables and, wherenecessary, can directly resolvemany-to-many relationships that might otherwisecausea show-stopper error in thedata-loading process.  Unlike an ordinary associationtable, the Link table, with its own surrogatePK, is able to track historic changes in the relationship itself between the two Hubs, and thus between their two directly-related OLTP sourcetables. Specifically, all loaded data that conformed with the initial cardinality between tables would sharethe same Link table surrogate key, but an unexpected, future sourcedata change that either caused a cardinality reversal(so that the one becomes the many, and vice versa), a new row, with a new surrogatekey, is generated to not only capture it now while the original surrogatekey preserves thehistorical relationship. Slick!  In a more sophisticated Data Vault schema than this one, we might go further by adding a add load_date and load_date_end data_stamp fields to Link tables, too. As an (admittedly strange) example, the Order_Store_l Link table might conceivably get date-time stamp fields so that, in coordination with its surrogatePK, an Order (perhaps for a long-running service) that, after the Order Date, gets re-credited to a different storecan be efficiently tracked over time in this way.
  10. 10. __________________________________________________________________________________________________________________________________________________________________________________ Page 10 of 13 DiagramC: Completed Data Vault Schema (Link tables added)
  11. 11. __________________________________________________________________________________________________________________________________________________________________________________ Page 11 of 13 Now, we’veadded Link tables. After scanning DiagramC, go back and compare it withDiagram A and note the movement of the various non-key attributes. Undoubtedly, you will also notice, and may be concerned, that the sourceschema’s fivetables justmorphed into the Data Vault’s twelve. Importantly, notethat the Diagram A’s Details table was transformed notinto a Hub-and-Satellite combination, but rather into a Link table. When you consider that an order detail record (a line item) is really justthe association between an Order and a Product(albeit an association with plenty of vital associated data), then it makes sensethat the Link table Details_l was created. This Link table, whosesole purposeis to relate the Orders_h and Products_h tables, of course, also needs a Details_l_s Satellite table to hold the show-stopper non-key attributes, Quantity and Unit Price. The Data Vault method does allow for some interpretation here. You might now be thinking, “Aha! So, we haven’t eliminated all subjectiveinterpretation!” Perhaps not, but whatI’ll describehere is a pretty small, generic interpretation. Either way, in this situation, it would not be patently wrong to design a Details_h Hub table (plus, of course, a Details_h_s Satellite), rather than the Details_l Link. Added to that, if we use very simple Data-Vaultdesign automation logic, which simply de-constructs all tables into Hub and Satellite pairs, this is whatwe would get. However, keep in mind that if we did that, we would then have to create not one, but two Link tables, specifically Order_Order_Details_l Link table and Product_Order_Details_l Link table to connect our tables, and these tables would contain no attributes of apparent value. Therefore, we choosethe design that leaves us with a simpler, more efficient Data Vault design. By the way, this logic can easily be automated, but that’s beyond the scopeof this article.
  12. 12. __________________________________________________________________________________________________________________________________________________________________________________ Page 12 of 13 Conclusion: Our discussion on Data Vault opened with the idea that an EDW should load and storehistoricaldata withoutapplying any transformations thatcontain subjectiveinterpretation of data or business-rules, becausethoseinterpretations, even if appropriatefor specific reporting or analytics, do modify line-of-business data, and thereforeintroduce distortions into operational data. Those interpretive transformations should occur downstreamduring ETL into presentation layer tables. Although Data Vault does, in fact, apply a specific set of generic ‘de-construction’ transformations, thesetransformations contain little or no subjective interpretation of business rules. They do, however, allow it to (1) apply an appropriatelevel of referential integrity to sourcedata even wherethe sourcesystemmay lack it now or in the future; (2) gracefully capture historical data changes, within and between tables, without endangering the success of the data load; (3) supportloading of data froma subsetof sourcetables initially, and then load, or not load, other related sourcedata tables much later without compromising the EDW’s referential integrity. Lastly, and very importantly; (4) data vault design and the associated Data Vault loading ETL, which is largely generic from one data set to another, can be automated, and thus radically accelerated in development. Although the logic of this automation flows fromthe simplicity of data vault design, a detailed automation discussion is beyond the scope of this article. In closing, if we can automatically design and load a Data Warehouse(albeit not it’s presentation layer), it frees up brain cells for the higher-order logic of design of the presentation layer and the intensive, customETL to load it. As I described here, all of this can be accomplished simultaneously. ________________________________________________ daniel upton dupton@decisionlab.net DecisionLab.Net business intelligence is business performance
  13. 13. __________________________________________________________________________________________________________________________________________________________________________________ Page 13 of 13 DecisionLab.Net Range of Services: _____________________________________________________ Business Intelligence Roadmapping,Feasibility Analysis BI ProjectEstimation and Requirement Modelstorming BI Staff Augmentation: Data Warehouse / Mart / Dashboard Design and Development _________________________________________________________________________________________________________________________________________________________________________ DanielUpton DecisionLab http://www.decisionlab.net dupton@decisionlab.net Direct760.525.3268 http://blog.decisionlab.net Carlsbad,California,USA
  • AbhishekShah34

    Mar. 11, 2016
  • Camino

    Jul. 8, 2015
  • fjgirante

    Jun. 5, 2014
  • Neous

    Apr. 2, 2014

Data Warehouse (especially EDW) design needs to get Agile. This whitepaper introduces Data Vault to newcomers, and describes how it adds agility to DW best practices.

Views

Total views

5,143

On Slideshare

0

From embeds

0

Number of embeds

20

Actions

Downloads

216

Shares

0

Comments

0

Likes

4

×