Data Vault: What is it? Where does it fit? SQL Saturday #249

2,601 views

Published on

Published in: Technology, Business

Data Vault: What is it? Where does it fit? SQL Saturday #249

  1. 1. DecisionLab.Net business intelligence is business performance ___________________________________________________________________________________________________________________________________________________________________________________ Data Vault: What is it? Where does it fit? - SQL Saturday #249 _________________________________________________________________________________________________________________________________________________________________________________________________________________ DecisionLab.Net http://www.decisionlab.net http://blog.decisionlab.net dupton@decisionlab.net Carlsbad, California, USA
  2. 2. Data Vault: What is it? Where does it fit? - SQL Saturday #249 daniel upton business intelligenceanalyst certified scrum master DecisionLab.Net business intelligence is business performance Reference: Database diagrams in this presentation are adaptations or expansions of those published in the article “Data Warehouse Generation Algorithm Explained”, available at… http://www.dwhautomation.org/data-warehouse-generation-algorithm-explained/ __________________________________________________________________________________________________________________________________________________________________________________ Page 2 of 24
  3. 3. Real-World BI-DW Implementation Questions: As an ETL Developer who often waits weeks or months without capturing any source data history while a new DW / DM target data model gets designed, would you instead like to – without compromising the star/snowflake schema design – quickly and with some automation, define and load a robust, history-tracking DW / staging repository? As a DW Architect, have recent requests from project sponsors – requests that your DW reporting / analytics (BI) solution also supply authoritative data into upcoming Master Data Management or Enterprise Data Quality solutions – got you nervous about your planned or (oops) already completed data transformations that were supposed to be scope-limited to BI only? When you remind them of the initially-agreed scope …they shrug! After six months of production data loads from a source system that does not track historic data changes, you discover that your DW-loading logic is wrong, and of course your staging area is overwritten with each cycle. __________________________________________________________________________________________________________________________________________________________________________________ Page 3 of 24
  4. 4. Data Vault resolveseach of the above challenges. This session will demonstrate this claim, while familiarizing you with Data Vault design fundamentals, briefly explore its potential for automation,and consider where it fits. __________________________________________________________________________________________________________________________________________________________________________________ Page 4 of 24
  5. 5. List of Entering Assumptions: Disk storage is sufficiently cheap Automation of back-end DW development tasks is appealing. rd Source data is in RDBMS with a at least a resemblance to 3 normal form Source data exists in disparate systems OR one system with poor data quality OR with the inability to efficiently track historic data changes. A non-volatile back-end data repository between operational systems and the BI-DW presentation layer (eg. Star Schema) is desired. Time-latency requirements do give us an ample time window to load the repository and then again transform data from there into the presentation layer. A proliferation, and substantial increase in the number, of tables is tolerable as long as both the design and loading of the schema is straightforward and, to some extent, automatable. __________________________________________________________________________________________________________________________________________________________________________________ Page 5 of 24
  6. 6. High-Level Introduction to Data Vault Methodology: We begin with a simple OLTP database design for sales transactions, plus a small excerpt of tables from ERP and CRM schema. For illustration purposes, I include aminimum of tables and fields. In the diagrams, ‘BK’ means business key, ‘FK’ means foreign key. Refer to Diagram A below. This OLTP schema usesno surrogate keys. If a client gets a new email address, or a product gets a new name, or a city’s remapping of boundary lines suddenly places an existing store in a new city, then for any given business key, new non-key values overwrite old values, which are therein lost. Of course, in order to preserve history, history-tracking surrogate keys are commonly used by practitioners of both W.H. Inmon’s classic third-normal form (3nf) EDW design, as well as Dr. Ralph Kimball’s Star Schema method, but both of these methods prescribe surrogate keys within the context of data transformations that also include subjective interpretation (herein simply ‘subjective transformation’) in order to cleanse or enhance the data for the purposes of integration to serve reporting or analytic needs. Data Vault purists claim that any such subjective transformation of line-of-business data introduces distortion to it, thereby disqualifying the Data Warehouse / Mart as system of record. Data Vault, by contrast, provides a simple yet unique way to track historical changes from source data while eliminating most, or all, subjective transformations such as data-quality filters, establishment of hierarchies, calculated fields, or target/goal values. Although analytics-driven, subjective transformations should still be applied for BI, they are applied downstream of the Data Vault EDW, as subsequent custom transformations for loads into data marts designed to analyze specific business processes. Back upstream, the Data Vault accomplishes historic change-tracking using a generic table-deconstructing approach that I will now describe. Before beginning, I recommend against too-quickly comparing this method to others, like star-schema design, which serve different needs. __________________________________________________________________________________________________________________________________________________________________________________ Page 6 of 24
  7. 7. Diagram A:Excerpts from three operational OLTP schema (data sources for Data Vault) __________________________________________________________________________________________________________________________________________________________________________________ Page 7 of 24
  8. 8. Diagram B:Sales Transaction Only __________________________________________________________________________________________________________________________________________________________________________________ Page 8 of 24
  9. 9. Diagram C: Hubs and Satellite in Source B’s partially-designed Data Vault schema __________________________________________________________________________________________________________________________________________________________________________________ Page 9 of 24
  10. 10. Fundamentally, Data Vault prescribes three types of tables: Hubs, Satellites, and Links.Let’s use Diagram B’s Client table asour example. Hubs and Satellites have the following characteristics: Hub Tables: Define the granularity of an entity (eg. Client), and thus the granularity of non-key attributes (eg. name, description) of the entity. Contain a new surrogate primary key (PK), as well as the source table’s business key, demoted from its PK role. Contain no non-key attribute fields such as name, address, email, telephone. Satellite Tables: Contain all non-key fields (attributes), plus a set of date-stamp fields Contain, as a Foreign Key (FK), the Hub’s PK, plus load date-time stamps. Have a defining, dependent entity relationship to one, and only one, parent table. Whether that parent table is a Hub or Link, the Satellite holds the non-key attribute fields from the parent table. Although on initial loads, only one Satellite row will exist for each corresponding Hub row, whenever a non-key attribute changes upstream in the OLTP schema (eg. a client’s email address changes, too-often accomplished with an over-write), a new row will be added only to the Satellite, but not the Hub, which is why many Satellite rows will relate to one Hub row. So, in this fashion, historic changes within source tables are gracefully tracked in the Data Vault. Notice, in Diagram C that, among other tables, the Client_h_s Satellite table is dependent to the Client_h Hub table, but that, at this stage in our design, the Client_h Hub is not yet related to Order_h Hub. When we add Links, those relationships will appear. But first, have a look at the tables, the new location of existing fields, and the various added date-time stamps. __________________________________________________________________________________________________________________________________________________________________________________ Page 10 of 24
  11. 11. Diagram D: Data Vault Schema w/ Link tables added: Complete for Source B only __________________________________________________________________________________________________________________________________________________________________________________ Page 11 of 24
  12. 12. Link Tables: See Diagram D Links relate exactly two Hubs together. Links contain, now as non-key values, the primary keys of the two Hubs, plus its own surrogate PK. Peg-leg links are special links in that they only relate to one Hub. More on this later. As with an ordinary association / join table, a Link is a child to both of two Hubs it relates to and, as such, it is able to gracefully handle the odd relative changes in cardinality between the two tables and cleanlysupports many-to-many relationships that are stored in the source system, and which otherwise either cause load-failing errors in the data-loading process or require ad-hoc data cleansing hard coding. Unlike an ordinary associationtable, the Link table, with its own surrogate PK in conjunction with date-stamp fields in both Hubs, allows us to track historic changes in the relationship itself between the two Hubs, and thus between their two directly-related OLTP source tables. Specifically, all loaded data that conformed with the initial cardinality between tables would share the same Link table surrogate key, but an unexpected, future source data change that either caused a cardinality reversal (so that the one becomes the many, and vice versa), a new row, with a new surrogate key, is generated to not only capture it now while the original surrogate key preserves the historical relationship. Slick! Limits to Automated Logic for Data Vault Design Note that the OLTP Details table was transformed not into a Hub-and-Satellite combination, but rather into a Link table, which seems valid insofar as Order Details can be considered to simply be a direct relationship between an Order and a Product. This logic may or may not be fully automatable. In a more sophisticated Data Vault schema than this one, we might go further by adding a add load_date and load_date_end data_stamp fields to Link tables, too. As an (admittedly strange) example, the Order_Store_l Link table might conceivably get date-time stamp fields so that, in coordination with its surrogate PK, an Order (perhaps for a long-running service) that, after the Order Date, gets re-credited to a different store can be efficiently tracked over time in this way. Of course, with this last of enhancement, we’re probably crossing the line from ‘automatable’ to ‘custom’ Data Vault design. __________________________________________________________________________________________________________________________________________________________________________________ Page 12 of 24
  13. 13. Summary: Steps in Basic Data Vault Automation Logic: A. Buy or build application to automate Data Vault schema design and/or Data Vault ETL code. B. Based on each OLTP table’s primary and foreign key structure, auto-tag each table as Hub, Satellite or Link C. Human review, overruling certain automated quick-tags with enhancements. D. Using either custom-built logic or purchased design-automation application, auto-generate Data Vault DDL. E. Using same, auto-generate ETL code for loading Data Vault. A: Buy or Build: Roll your own with macros in ERwin, ER Studio, etc. Buy: Consider BIReady, QUIPU, WhereScape Red No silver bullets __________________________________________________________________________________________________________________________________________________________________________________ Page 13 of 24
  14. 14. B. Auto-Tag:Hub, Satellite or Link -- Notice how simple (thus automatable) these rules are Rules:A Bottom-Up process of identifying ‘non-Hubs’ o Satellite Auto-Tag:A table on which there are no other tables’ foreign keys referencing it, and which has its own foreign-key field also acting as its primary key, and finally, the primary keys contains no other fields. o Peg-Leg Link Auto-Tag:Same exactly as above, except that the primary key contains one or more additional fields: is a candidate to become a peg leg link. o Link Auto-Tag: A table on which there are no other tables’ foreign keys referencing it, and has more than one foreign key with all foreign keys collectively contained within the primary key. The primary key may include other fields, as well.  As a reminder, many, perhaps most Links are created not directly from individual source tables, but rather from the direct (existing or to-be-designed) relationships between source tables. In this case does not matter if the primary key is wider than all the foreign keys together or not. o Hub Auto-Tag: A table which does not fit one of the above rules is a Hub. C. Human Review: Overrule certain automated quick-tags with enhancements based on experience with the database, data and business. See above section on ‘Limits to Automated Logic for Data Vault Design’ D. Generate DDL Code E. Generate logic for ETL Code F. Capture ETL code and setup scheduled Data Vault loads __________________________________________________________________________________________________________________________________________________________________________________ Page 14 of 24
  15. 15. Diagram E: Data Vault Schema: Complete for (excerpts of) Sources A through C __________________________________________________________________________________________________________________________________________________________________________________ Page 15 of 24
  16. 16. Diagram F: Data Vault Schema - Integrated Data Vault Spanning Multiple Operational Databases __________________________________________________________________________________________________________________________________________________________________________________ Page 16 of 24
  17. 17. Diagram G: Summary of OLTP into Data Vault OLTP Source Data Vault __________________________________________________________________________________________________________________________________________________________________________________ Page 17 of 24
  18. 18. In Diagram G… …We note that that the source schema’s seven tables just morphed into the Data Vault’s eighteen. When you consider that an order detail record (a line item) is really just the association between an Order and a Product (albeit an association with plenty of vital associated data), then it makes sense that the Link table Details_l was created. This Link table, whose sole purpose is to relate the Orders_h and Products_h tables, of course, also needs a Details_l_s Satellite table to hold the show-stopper non-key attributes, Quantity and Unit Price. The Data Vault method does allow for some interpretation here. You might now be thinking, “Aha! So, we haven’t eliminated all subjective interpretation!” Perhaps not, but what I’ll describe here is a pretty small, generic interpretation. Either way, in this situation, it would not be patently wrong to design a Details_h Hub table (plus, of course, a Details_h_s Satellite), rather than the Details_l Link. Added to that, if we use very simple Data-Vault design automation logic, which simply de-constructs all tables into Hub and Satellite pairs, this is what we would get. However, keep in mind that if we did that, we would then have to create not one, but two Link tables, specifically Order_Order_Details_l Link table and Product_Order_Details_l Link table to connect our tables, and these tables would contain no attributes of apparent value. Therefore, we choose the design that leaves us with a simpler, more efficient Data Vault design. By the way, this logic can easily be automated, but that’s beyond the scope of this article. __________________________________________________________________________________________________________________________________________________________________________________ Page 18 of 24
  19. 19. Diagram H: Data Vault Supplies Authoritative Enterprise Data To Multiple Target Applications Custom transformations with subjective data re-interpretation __________________________________________________________________________________________________________________________________________________________________________________ Page 19 of 24
  20. 20. Diagram I Data Vault Supplies Authoritative Enterprise Data To Multiple Target Applications Custom transformations with subjective data re-interpretation “The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework” - W.H. Inmon What about the 3NF EDW? Data Vault: Staging or EDW? __________________________________________________________________________________________________________________________________________________________________________________ Page 20 of 24
  21. 21. Diagram J Data Vault Supplies Authoritative Enterprise Data To Multiple Target Applications What else? If Data Vault provides... an authoritative, non-volatile, history-tracking RDBMS repository with robust yet forgiving referential integrity, while imposing little or no subjective data reinterpretation from multiple operational systems, then... it has benefits for target systems beyond BI-DW: Custom transformations with subjective data re-interpretation Data Quality Management: Data Vault gracefully and permanently stores the good, the bad and the ugly (outliers, RFI violations, etc) and their improvement (or lack thereof) over time. “The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework” - W.H. Inmon What about the 3NF EDW? Data Vault: Staging or EDW? Master Data Management: * A More appropriate MDM data source than Dimensional Data Marts. * In closed-loop MDM: Data Vault feeds operational data to MDM, which then publishes improved Master Data back for operational systems, which continue to automatically feed Data Vault, so Data Vault captures MDM adoption levels across multiple operational systems. __________________________________________________________________________________________________________________________________________________________________________________ Page 21 of 24
  22. 22. Review: Real-World Questions and Entering Assumptions: Real-World BI-DW Implementation Questions: ETL Developer waiting without capturing source data history while data model gets designed. Instead, define and load a robust, history-tracking DW / staging repository? Requests that BI-DW solution supply authoritative data into MDM or DQ solution. What about ETL that wassupposed to be BI-only? Six months post-release, you discover that your DW-loading logic is wrong and significant source data has been overwritten. __________________________________________________________________________________________________________________________________________________________________________________ Page 22 of 24
  23. 23. Entering Assumptions: Disk storage is sufficiently cheap Automation of back-end DW development tasks is appealing. Source data is in RDBMS with a at least a resemblance to 3rd normal form Source data exists in disparate systems OR one system with poor data quality OR with the inability to efficiently track historic data changes. A non-volatile back-end data repository between operational systems and the BI-DW presentation layer (eg. Star Schema) is desired. Time-latency requirements do give us an ample time window to load the repository and then again transform data from there into the presentation layer. A proliferation, and substantial increase in the number, of tables is tolerable as long as both the design and loading of the schema is straightforward and, to some extent, automatable. Conclusion: Data Vault can indeed resolve these challenges. __________________________________________________________________________________________________________________________________________________________________________________ Page 23 of 24
  24. 24. Thank you! DecisionLab.Net business intelligence is business performance _____________________________________________________ daniel upton business intelligence analyst certified scrum master dupton@decisionlab.net http://www.linkedin.com/in/DanielUpton http://www.slideshare.net/DanielUpton __________________________________________________________________________________ __________________________________________________________________________________________________________________________________________________________________________________ Page 24 of 24

×