Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Vault: Data Warehouse Design Goes Agile

4,522 views

Published on

Data Warehouse (especially EDW) design needs to get Agile. This whitepaper introduces Data Vault to newcomers, and describes how it adds agility to DW best practices.

Published in: Technology
  • Yes, I've used Data Vault with SQL Server 2012.
    I do not advocate the use of Data Vault as a presentation layer for analytic- or reporting queries, but rather as a permanent data staging repository for a Data Warehouse, Master Data Management, Data Quality and/or Business Intelligence solution. Because of the substantial increase in tables, thus table joins, I expect that query performance would be not only worse than from a star schema, but even worse than the original OLTP schema from which the Data Vault was sourced.
    If you wish to build as SSAS Cube sourced from a Data Vault, I recommend doing one of two things...
    (1) Perform additional ETL from the data vault and deliver stored data into a Star Schema presentation layer, or...
    (2) If your data size is small and/or your acceptable SSAS cube processing time window is sufficiently lengthy, you might get away simply with build star-schema-like views on the Data Vault, and build a cube from those views. I don't love this approach, but it may suffice in some cases.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Recently, a reader emailed me with the following comments (which I respond to in the post immediately following this one)...

    'I enjoyed reading your deck on slideshare.net re Data Vault. I was wondering if you have tinkered with the DV architecture using SQL Server 2012, and had users query off it thru a view from SSRS? Where there any performance gains over a Kimball star schema?

    Also, I was thinking about using SSAS to build some cubes off a DV physical schema. Have you done anything like that before? It's hard to find anything much about DV support from the MSFT SQL Server team.

    My response follows in the next post...
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Good questions. Here's what comes to my mind. In separating business keys (into Hubs) from non-key fields (into Satellites) into separate tables, we can
    (1) Historically track changing values in source systems wherein the changes are simply over-written. Example: Customer BizKey = 123. Customer FirstName = 'Megan'. Customer LastName (at first) = 'Jones'. Megan marries Joe Smith, so Megan's LastName value is over-written as 'Smith'. Although the source system loses 'Jones', the Data Vault's 'Customer_Satellite' table now has two rows in it as children to the 'Customer_Hub' table record where Customer_BisKey = '123'. So, now we have almost 'SCD Type 2' history tracking without the introduction of what I call 'subjective transformations', which are driven by reporting and analytics requirements, but which may corrupt the data for MDM usage.
    (2) As we load related data from other source systems, the separate of Hub and Satellite data may result in more efficient joins of Hubs to each other (via Link Tables) in the Vault,and, to answer your other question, this absolutely includes joins to Hubs integrated from other data sources, without having to store non-key fields on disc with business keys.
    Lastly, and although you did not ask this question, I should clarify something. When you do, in fact, integrate data from other data sources into Data Vault, that data modelling process is not, in my mind, automate-able, and requires careful data profiling. Of course, this data modeling is much simpler than in an Inmon (or Kimball) modelling process, since the cross-database joins simply need key-to-key matches at this stage. Of course, the higher-order logic for design and ETL loading of downstream presentation layer schema has not gone away, but the initial loading tasking was not delayed until their completion either.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Daniel, I enjoyed the reading and appreciate your creativity very much. I would be fun to see this solution implemented and productized. With that said, these two questions may help understanding your design better: 1. What are the benefits of separating the hub and satellite tables? Why not have just the Client, Product, etc. tables with all the surrogate and business keys (BK would not be unique over time, of course)? 2. Is there an assumption that there is only one source OLTP system for a data domain (Client, Product, Order,ctc.)? What if orders or products may arrive from multiple source apps (including external apps that you do not manage)? How do you vault those? Thank you! Michael Romm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Data Vault: Data Warehouse Design Goes Agile

  1. 1. DecisionLab.Net business intelligence is business performance ___________________________________________________________________________________________________________________________________________________________________________________ ____________________________________________________________________________________________________________________________________________________________________________________ DecisionLab http://www.decisionlab.net dupton@decisionlab.net direct760.525.3268 http://blog.decisionlab.net Carlsbad,California,USA Data Vault: Data Warehouse Design Goes Agile
  2. 2. __________________________________________________________________________________________________________________________________________________________________________________ Page 2 of 13 Whitepaper Data Vault: Data Warehouse Design Goes Agile by daniel upton data warehouse modeler and architect certified scrum master DecisionLab.Net business intelligence is business performance dupton@decisionlab.net http://www.linkedin.com/in/DanielUpton Without my (the writer’s) explicit written permission in advance, the only permissible reproduction or copying of this written material is in the form of a review or a brief reference to a specific concept herein, either or which must clearly specify this writing’s title, author (me), and this web address http://www.slideshare.net/DanielUpton/lean-data-warehouse-via-data-vault . For permission to reproduce or copy any of this material other than what is specified above, just email me at the above address.
  3. 3. __________________________________________________________________________________________________________________________________________________________________________________ Page 3 of 13 Open Question: When we begin considering a new Data Warehouse initiative, how clear is the scope, really? If weintend to design Data Marts, and we have no specified need for a data warehouseeither to become a systemof record, or to supportMaster Data Management (MDM), then we may chooseto Dr. Ralph Kimball’s Data WarehouseBus architecture, designing a library of conformed (standardized, re-usable) dimension and fact tables for deployment into a series of purpose-builtdata marts. Under these requirements, wemay have no specific need for an Inmon stylethird-normalform (3nf) EnterpriseData Warehouse(EDW) in general, or for a Data Vault in particular. In other cases, however, because sometimes data warehousedata outlives its corresponding sourcedata inside a soon-to-retireapplication database, then, like it or not, a data warehousemay, as Bill Inman remind us, assumea systemof record role for its data. Whereas the Kimball Bus architecture’s tables are often not related via key fields, and in fact may not be populated at all until deployment fromthe Bus into a specific-needs Data Mart, Kimball adherents rarely asserta system-of-record rolefor their solutions. But, supposewedo determine that our required solution either does need to assumea systemof record role, or perhaps that it mustsupportMaster Data Management. As such, wemay elect to design a fully functionalEDW, rather than Kimball’s DW Bus, so that the EDW itself, and not justits dependent data marts, is a working, populated database. Now, knowing that the creation of a classic EDW, with its requirement for an up-front, enterprise-widedesign, is a challenge with today’s expectations for rapid delivery, some may be curious aboutnew design methodologies offer ways to accelerate EDW Design. Data Vault, a data warehousemodeling method with a substantialfollowing in Denmark, and a growing basein the U.S., offers specific and important benefits. In order to set expectations early about Data Vault, readers mustunderstand that, somewhatunlike a traditional EDW, and utterly unlike a star-schema, a Data Vault (not to be confusedwithBusiness DataVault, whichis not addressedinthis article) cannot serve as an efficient presentationlayer appropriate for direct queries. Rather, it is morelike a historic enterprise data staging repository that, with additional downstreamETL, will supportnotonly star-schema, reporting and data mining, but also master data management, data quality and other enterprise data initiatives.
  4. 4. __________________________________________________________________________________________________________________________________________________________________________________ Page 4 of 13 Data Vault Benefits:  Benefit #1: Allows for loading of a history-tracking DW with little or none of the typical extraction, transformation and loading (ETL) transformations that, oncethey are finally figured out, would otherwisecontain subjective-interpretations of the data and which purportedly enhancethe data and prepareit for reporting or analytics. o In my view, this is almost enough of a benefit all by itself. As such, in my introduction that follows, I will focus on proving this point. o Agile Win: Confidently loading a DW without having to already know the fine details of business rules and requirements and the resulting transformation requirements means that loading of historicaland incremental data could get accomplished before the firsttarget databasedesign (3nf EDW or Data Mart) is complete.  Benefit #2: Insofar as Data Vaultprescribes a very generic downstream‘de-constructing’ of OLTP tables, thesede- constructing transformations can beautomated and so can it’s associated early-stageETL into Data Vault. Since, as you’ll soon see, Data Vault causes a substantial increasein the number of tables, this automation potential is a substantialbenefit. o Agile Win: Automated initial design and loading, anyone?  Benefit #3: Due to Data Vault’s generic design logic, it’s use of surrogatekeys (moreon this soon), and it’s prescription to avoid subjective-interpretivetransformations, it’s reasonableto quickly load a Data Vaultjustwith the needed subset of tables. o Agile Win: More frequent releases. Quickly design for, and load, only the data needed for the next release. Use the samegeneric design to load other tables when those User Stories fromthe ProductBacklog get placed into a Sprint. In the remainder of this article, I will provide a high level introduction to Data Vault, with primary emphasis on how it achieves Benefit #1.
  5. 5. __________________________________________________________________________________________________________________________________________________________________________________ Page 5 of 13 High-Level IntroductiontoDataVault Methodology: We begin with a simple OLTP databasedesign for clients purchasing products froma company’s stores. For simplicity, I include only a minimum of fields. In the diagrams, ‘BK’ means business key, ‘FK’ means foreign key. Refer to DiagramA below. As is common, this simple OLTP schema does not use surrogatekeys. If a client gets a new email address, or a productgets a new name, or a city’s re-mapping of boundary lines suddenly places an existing storein a new city, new values would overwritethe old values, which would then be lost. Of course, in order to preservehistory, history-tracking surrogatekeys are commonly used by practitioners of both Bill Inmon’s classic third-normalform(3nf) EDW design, as well as Dr. Ralph Kimball’s Star Schema method, but both of these methods prescribesurrogatekeys within the context of data transformations thatalso include subjectiveinterpretation (herein simply ‘subjectivetransformation’) in order to cleanse or purportedly enhance the data for the purposes of integration, reporting, or analytics. Data Vault purists claim that any such subjectivetransformation of line-of-business data introduces inappropriatedistortion to it, thereby disqualifying the Data Warehouseas systemof record. Data Vault, importantly, provides a unique way to track historical changes in sourcedata while eliminating most, or all, subjectivetransformations such as field renaming, selective data-quality filters, establishment of hierarchies, calculated fields, and target values. Although analytics-driven, subjectivetransformations can still be applied, they are applied downstreamof the Data Vault EDW, as subsequenttransformations for loads into data marts designed to analyze specific processes. Back upstream, the Data Vault accomplishes historic change-tracking using a generic table-deconstructing approach that I will now describe. Before beginning, I recommend against too-quickly comparisons this method others, like star-schema design, which servedifferent needs.
  6. 6. __________________________________________________________________________________________________________________________________________________________________________________ Page 6 of 13 DiagramA: Simple OLTP schema (data sourcefor a Data Vault)
  7. 7. __________________________________________________________________________________________________________________________________________________________________________________ Page 7 of 13 Fundamentally, Data Vault prescribes three types of tables: Hubs, Satellites, and Links. The diagram’s Client table as a good example. Hubs work according to the following simplified description: Hub Tables:  Define the granularity of an entity (eg. product), and thus the granularity of non-key attributes (eg. productdescription) within the entity.  Contain a new surrogateprimary key (PK), as well as the sourcetable’s business key, which is demotes fromits PK role. Satellite Tables:  Contain all non-key fields (attributes), plus a set of date-stamp fields  Contain, as a Foreign Key (FK), the Hub’s PK, plus the load date-time stamps.  Have a defining, dependent entity relationship to one, and only one, parent table.  Whether that parent table is a Hub or Link, the Satellite holds the non-key fields fromthe parenttable.  Although on initial loads, only one Satellite row will exist for each corresponding Hub row, whenever a non-key attribute change(eg. a client’s email address changes) upstreamin the OLTP schema (often accomplished up there with a simple over-write), a new row will be added only to the Satellite, and not the Hub, which is why many Satellite rows relate to one Hub row. So, in this fashion, historic changes within sourcetables are gracefully tracked in the EDW. Notice, in DiagramB that, among other tables, the Client_h_s Satellite table is dependent to the Client_h Hub table, but that, at this stage in our design, the Client_h Hub is not yet related to Order_h Hub. When we add Links, thoserelationships will appear. But first, have a look at the tables, the new location of existing fields, and the various added date-time stamps.
  8. 8. __________________________________________________________________________________________________________________________________________________________________________________ Page 8 of 13 DiagramB: Hubs and Satellite in a partially-designed Data Vault schema
  9. 9. __________________________________________________________________________________________________________________________________________________________________________________ Page 9 of 13 Link Tables:  Refer to Diagram C  Relate exactly two Hub tables together.  Contain, now as non-key values, the primary keys of the two Hubs, plus its own surrogatePK.  As with an ordinary association table, a Link is a child to two other tables and, as such, is able to gracefully handle relative changes in cardinality between the two tables and, wherenecessary, can directly resolvemany-to-many relationships that might otherwisecausea show-stopper error in thedata-loading process.  Unlike an ordinary associationtable, the Link table, with its own surrogatePK, is able to track historic changes in the relationship itself between the two Hubs, and thus between their two directly-related OLTP sourcetables. Specifically, all loaded data that conformed with the initial cardinality between tables would sharethe same Link table surrogate key, but an unexpected, future sourcedata change that either caused a cardinality reversal(so that the one becomes the many, and vice versa), a new row, with a new surrogatekey, is generated to not only capture it now while the original surrogatekey preserves thehistorical relationship. Slick!  In a more sophisticated Data Vault schema than this one, we might go further by adding a add load_date and load_date_end data_stamp fields to Link tables, too. As an (admittedly strange) example, the Order_Store_l Link table might conceivably get date-time stamp fields so that, in coordination with its surrogatePK, an Order (perhaps for a long-running service) that, after the Order Date, gets re-credited to a different storecan be efficiently tracked over time in this way.
  10. 10. __________________________________________________________________________________________________________________________________________________________________________________ Page 10 of 13 DiagramC: Completed Data Vault Schema (Link tables added)
  11. 11. __________________________________________________________________________________________________________________________________________________________________________________ Page 11 of 13 Now, we’veadded Link tables. After scanning DiagramC, go back and compare it withDiagram A and note the movement of the various non-key attributes. Undoubtedly, you will also notice, and may be concerned, that the sourceschema’s fivetables justmorphed into the Data Vault’s twelve. Importantly, notethat the Diagram A’s Details table was transformed notinto a Hub-and-Satellite combination, but rather into a Link table. When you consider that an order detail record (a line item) is really justthe association between an Order and a Product(albeit an association with plenty of vital associated data), then it makes sensethat the Link table Details_l was created. This Link table, whosesole purposeis to relate the Orders_h and Products_h tables, of course, also needs a Details_l_s Satellite table to hold the show-stopper non-key attributes, Quantity and Unit Price. The Data Vault method does allow for some interpretation here. You might now be thinking, “Aha! So, we haven’t eliminated all subjectiveinterpretation!” Perhaps not, but whatI’ll describehere is a pretty small, generic interpretation. Either way, in this situation, it would not be patently wrong to design a Details_h Hub table (plus, of course, a Details_h_s Satellite), rather than the Details_l Link. Added to that, if we use very simple Data-Vaultdesign automation logic, which simply de-constructs all tables into Hub and Satellite pairs, this is whatwe would get. However, keep in mind that if we did that, we would then have to create not one, but two Link tables, specifically Order_Order_Details_l Link table and Product_Order_Details_l Link table to connect our tables, and these tables would contain no attributes of apparent value. Therefore, we choosethe design that leaves us with a simpler, more efficient Data Vault design. By the way, this logic can easily be automated, but that’s beyond the scopeof this article.
  12. 12. __________________________________________________________________________________________________________________________________________________________________________________ Page 12 of 13 Conclusion: Our discussion on Data Vault opened with the idea that an EDW should load and storehistoricaldata withoutapplying any transformations thatcontain subjectiveinterpretation of data or business-rules, becausethoseinterpretations, even if appropriatefor specific reporting or analytics, do modify line-of-business data, and thereforeintroduce distortions into operational data. Those interpretive transformations should occur downstreamduring ETL into presentation layer tables. Although Data Vault does, in fact, apply a specific set of generic ‘de-construction’ transformations, thesetransformations contain little or no subjective interpretation of business rules. They do, however, allow it to (1) apply an appropriatelevel of referential integrity to sourcedata even wherethe sourcesystemmay lack it now or in the future; (2) gracefully capture historical data changes, within and between tables, without endangering the success of the data load; (3) supportloading of data froma subsetof sourcetables initially, and then load, or not load, other related sourcedata tables much later without compromising the EDW’s referential integrity. Lastly, and very importantly; (4) data vault design and the associated Data Vault loading ETL, which is largely generic from one data set to another, can be automated, and thus radically accelerated in development. Although the logic of this automation flows fromthe simplicity of data vault design, a detailed automation discussion is beyond the scope of this article. In closing, if we can automatically design and load a Data Warehouse(albeit not it’s presentation layer), it frees up brain cells for the higher-order logic of design of the presentation layer and the intensive, customETL to load it. As I described here, all of this can be accomplished simultaneously. ________________________________________________ daniel upton dupton@decisionlab.net DecisionLab.Net business intelligence is business performance
  13. 13. __________________________________________________________________________________________________________________________________________________________________________________ Page 13 of 13 DecisionLab.Net Range of Services: _____________________________________________________ Business Intelligence Roadmapping,Feasibility Analysis BI ProjectEstimation and Requirement Modelstorming BI Staff Augmentation: Data Warehouse / Mart / Dashboard Design and Development _________________________________________________________________________________________________________________________________________________________________________ DanielUpton DecisionLab http://www.decisionlab.net dupton@decisionlab.net Direct760.525.3268 http://blog.decisionlab.net Carlsbad,California,USA

×