Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton


Published on

How to adeptly model a lean data warehouse for maximum adapability to a changing business, changing source data, changing business rules, changing requirements, and changing needs for integration with NoSQL repositories.

Published in: Data & Analytics
  • Be the first to comment

EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton

  1. 1. EDW Data Model Storming for Integration of NoSQL with RDBMS SQL Saturday #497, April 2, 2016 Daniel Upton DW-BI Architect, Data Modeler DecisionLab.Net Serving Orange County and San Diego County since 2007 blog:
  2. 2. __________________________________________________________________________________________________________________________________________________________________________________ Page 2 of 20 Open Questions o With DW-BI now a mainstream I.T. career specialization with an established set of best- practices, why do many real-world implementations still fall short of satisfying business stakeholder expectations? o What influence have Lean and Agile thinking had on DW-BI? o What parts of DW implementation have been most resistant to Agile? o Are established DW data modeling methods an asset or a liability? o What factors are driving change in data modeling for business intelligence? o What is Data Model Storming? o What challenges does NoSQL introduce to data modeling intended for integration with RDBMS data? o What do we mean by Integration? o What challenges does NoSQL introduce to data modeling intended for integration with RDBMS data? o What does End-to-End Model Storming mean? Objectives: o Describe a data modeling method and demonstrate how it differs from both dimensional modeling and 3rd Normal Form according to… o Agile: Quickly and iteratively deliver minimally viable products (MVP’s) to users. o Lean: Design in loose coupling to minimize or eliminate functional dependencies o PMBOK: Breakdown work (including design) into small-yet-cohesive chunks. o Review BEAM Dimensional Model Storming (Corr and Stagnitto) o Demonstrate some best-practice NoSQL data models as major variations from 3rd Normal Form. o Introduce and Perform EDW Model Storming with a simple use case involving unpredictable, last minute changes to business rules o Extend the Model Storm w/ a last-minute requirement for NoSQL integration
  3. 3. __________________________________________________________________________________________________________________________________________________________________________________ Page 3 of 20 Traditional Data Modeling Methods 3rd Normal Form Dimensional Warehouse / Mart: OLTP and EDW Star Schema w/ Facts and Dimensions
  4. 4. __________________________________________________________________________________________________________________________________________________________________________________ Page 4 of 20 3rd Normal OLTP Source Data Vault: Aliases: Lean DW, Hyper-Normal Model o One Hub and all of its dependent Satellites are known as an Ensemble, a stand-alone set of tables that always have zero functional dependencies on other Ensembles. o Hubs store business keys (unique identifiers well-known to non-techies and enterprise-wide o Satellites store and historize all attribute fields
  5. 5. __________________________________________________________________________________________________________________________________________________________________________________ Page 5 of 20 o Links store all relationships as associations
  6. 6. __________________________________________________________________________________________________________________________________________________________________________________ Page 6 of 20 BEAM Model Storming (Corr and Stagnitto) o Accelerates agile dimensional design with a great short-hand notation on eye-friendly visual information displays to perform real-time dimensional design occurring during requirements meetings with business stakeholders. o Begins with user-information story o Ends with artifacts that capture the business requirement while also specifying the logic for a star schema. o One such artifact is an event matrix (minimal example): o Includes source data column profiling at column/record level; ignores source data structure
  7. 7. __________________________________________________________________________________________________________________________________________________________________________________ Page 7 of 20 Best-Practice NoSQL (Wide-Table, No Joins) Data Model: Why not in 3rd Normal Form? o Fields duplicated and / or pivoted to balance join-minimization with redundant storage. o Just an example, not to be integrated in our example...
  8. 8. __________________________________________________________________________________________________________________________________________________________________________________ Page 8 of 20 More on Lean Data Warehouse / Hyper-Normal / Data Vault): Objectives o Fully enforced, simple (single-field equi-joins only) referential integrity o Identify a business key, store values as unique records in a Hub table; Surrogate PK removes all functional dependencies (tight couplings) to this identifier FROM other tables’ FK’s o Store history of value changes to all attributes in a child table using LoadDTS and LoadEndDTS. o Store all table relationships to accommodate any current or future real-world cardinality relationships (1-to-1, 1-to-M, & M-to-M), via an associative join table. Why o While preserving all actual relationships between records in related tables, all DW table relationships now abstracted as Hub_PK, related to Link_FK, related to Hub_PK. o For Satellite’s identifier fields that, in source, were used as foreign keys (thus tightly coupled), remove these functional dependencies TO other DV Ensembles. o Benefits: o Zero functional dependencies between DW Ensembles, thus small increments may be designed, loaded and released based only on definition of a Minimally Viable Product (MVP, rather than forcing larger slower releases or more functionally inter-twined, thus much larger increments. o When a directly related data subject area is later to be added in, this is accomplished with zero re-factoring of the existing ensembles.
  9. 9. __________________________________________________________________________________________________________________________________________________________________________________ Page 9 of 20 Mindset for Lean DW ModelStorm Design: o K.I.S.S: Once a source table determined in-scope, include all fields and records, so you never have add them later. o Other than creating Hubs, Satellite, and Links, perform no other transformations in this layer. o No calculations, aggregations, or business rules (yet). o As such, we are NOT, or at least NOT YET attempting to define a single version of the truth (SVOT), nor a data presentation / reporting RDBMS layer. o Instead, we are… o Loosely integrating data from multiple data sources o Aligning it around business keys o Tracking the history of attributes whose old values may be overwritten in source systems o Supporting all actual (intended and otherwise) relationships among records in related tables. o Doing all of the above while enforcing simple referential integrity exclusively with single- field equi-joins.
  10. 10. __________________________________________________________________________________________________________________________________________________________________________________ Page 10 of 20 DW ModelStorm Design Steps: o Begin where BEAM ModelStorming Ends. From there… o Define Business Keys o Identify in-scope source tables o Reverse engineer in-scope tables into Data Modeling tool o Identify and define cardinality of physical and logical (non-instantiated) relationships o Classify each source table as a bonafide Entity or merely an Association
  11. 11. __________________________________________________________________________________________________________________________________________________________________________________ Page 11 of 20
  12. 12. __________________________________________________________________________________________________________________________________________________________________________________ Page 12 of 20 Now, group the source tables into distinct Subject Areas o Make copies of all above tables and place into a new submodel
  13. 13. __________________________________________________________________________________________________________________________________________________________________________________ Page 13 of 20 Next, for each new table-copy… o Remove all (source-based) foreign key relationships without removing underlying identifier fields. o Remove primary key constraint. o Add the following control / metadata fields:  DWLoadBatchID_SourceSys  DW_Load_DTS  DW_Load_Expire_DTS  Placeholder_SurrogateKey (explained later) o Create new composite Primary Key w/ Placeholder Surrogate Key + Load_DTS o Satellite-splitting  If a subset of fields are updated in source much more frequently than others, and table will be sufficiently large that ETL processing of the more frequent updates will result in excessive loading time, split table in two or more subsets.
  14. 14. __________________________________________________________________________________________________________________________________________________________________________________ Page 14 of 20
  15. 15. __________________________________________________________________________________________________________________________________________________________________________________ Page 15 of 20 Then, starting with tables classified earlier as bonafide entities  In new submodel, rename Placeholder_SurrogateKey field to Hub[EntityName]_SQN (or …_HashId) for all tables split from the source entity table  Copy one of these tables again  In newest table-copy, delete all fields except new PK, new control fields AND Business Key  Rename table as “ Hub_[Enter Entity Name Here] “  Remove ‘Load_DTS’ from Primary Key  Add a unique constraint to the Business Key.  In each corresponding tables, rename each as “ Sat_[Enter Entity Name Here_&Something] “  Create a defining relationship between Hub (parent / 1) and each “ Sat_[Enter Entity Name Here_&Something] “ so that child tables FK is also part of it’s PK.  Once all entity tables are converted into Hub-Satellite Sets, start on mere-Association tables  Still in new submodel, repeat above steps to add control fields  Add new “ Link_[Assoc_Name)_SQN (or _HashID)  As above, set PK as …SQN + Load_DTS  Rename table to “ Sat_Link_[Enter Assoc. Name Here] “  Create another copy of table, and rename as “ Link_[Enter Assoc. Name Here] “  Follow same remaining steps as with Hubs, except that no Business Key remains in the link.  Create defining relationship from Link (child) to directly related Hubs (parents), so that Hub_[ParentHub]_SQN is included in the Link.  Create Unique Key on composite of Hub_ParentHub_SQN fields. o Create defining relationship from Link (parent) to LinkSat (child)
  16. 16. __________________________________________________________________________________________________________________________________________________________________________________ Page 16 of 20 When all Hubs, Links, Satellites done, our examples looks like this…
  17. 17. __________________________________________________________________________________________________________________________________________________________________________________ Page 17 of 20 At this time, in the 11th Hour prior to our release, a new requirement is announced o With a truly elegant display of back-pedaling and dissembling -- by our primary business stakeholder, standing alongside the organization’s new data scientist. o Remember that ‘not to be integrated NoSQL example? Well, it does need to integrate after all, and, oops, before the release. o For what it’s worth, the data scientist assures us that, with his astonishing coding skills, he neither needs nor wants a data presentation layer or SVOT.
  18. 18. __________________________________________________________________________________________________________________________________________________________________________________ Page 18 of 20 Your team huddles privately afterwards…  Amid the grumbling, the PM politely asks, “How long will this take to design and load it”.  You smile & answer: 1 – 2 days. An hour later, you show these model additions…
  19. 19. __________________________________________________________________________________________________________________________________________________________________________________ Page 19 of 20 Questions:  Does Lean Data Warehouse (Data Vault / Hyper Normal) extend to complex data models with many source systems?
  20. 20. __________________________________________________________________________________________________________________________________________________________________________________ Page 20 of 20 DecisionLab.Net _____________________________________________________________________ Data Warehouse / Business Intelligence envisioning, implementation, oversight, and assessment ________________________________________________________________________________________________________________ This slide deck available now at… _______________________________________________________________________________________________________________ Daniel Upton Carlsbad, CA blog: phone 760.525.3268