Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IRM UK - 2009: DV Modeling And Methodology


Published on

This was a presentation I gave to IRM UK conference in November 2009. It covers some interesting details around the steps you should take to build your Data Vault, and an overview as to why re-engineering creeps in to your existing silo solutions.

Published in: Technology, Business
  • Be the first to comment

IRM UK - 2009: DV Modeling And Methodology

  1. 1. 10/6/2011<br /><br />1<br />
  2. 2. Data Vault Modeling MethodologyA Primer…<br />© Dan Linstedt 2009-2012<br />All Rights Reserved<br /><br />
  3. 3. A bit about me…<br />3<br />Author, Inventor, Speaker – and part time photographer…<br />25+ years in the IT industry<br />Worked in DoD, US Gov’t, Fortune 50, and so on…<br />Find out more about the Data Vault:<br /><br /><br />Full profile on<br /><br />
  4. 4. What IS a Data Vault? (Business Definition)<br />Data Vault Model<br />Detail oriented<br />Historical traceability<br />Uniquely linked set of normalized tables<br />Supports one or more functional areas of business<br />10/6/2011<br /><br />4<br /><ul><li>Data Vault Methodology
  5. 5. CMMI Level 5 Project Plan
  6. 6. Risk, Governance, Versioning
  7. 7. Peer Reviews, Release Cycles
  8. 8. Repeatable, Consistent, Optimized
  9. 9. Complete with Best Practices for BI/DW</li></ul>Business Keys<br />Span / Cross<br />Lines of Business<br />Sales<br />Contracts<br />Planning<br />Delivery<br />Finance<br />Operations<br />Procurement<br />Functional Area<br />
  10. 10. What Does One Look Like?<br />10/6/2011<br /><br />5<br />Records a history of the interaction<br />Customer<br />Product<br />Sat<br />Sat<br />Sat<br />Sat<br />Sat<br />Link<br />Customer<br />Product<br />F(x)<br />F(x)<br />F(x)<br />Sat<br />Sat<br />Sat<br />Sat<br />Order<br />F(x)<br />Sat<br />Order<br />Elements:<br /><ul><li>Hub
  11. 11. Link
  12. 12. Satellite</li></ul>Hub = List of Unique Business Keys<br />Link = List of Relationships, Associations<br />Satellites = Descriptive Data<br />
  13. 13. Who’s Using It?<br />10/6/2011<br /><br />6<br />
  14. 14. The PAIN!!<br />Issues in Current EDW Projects<br />10/6/2011<br /><br />7<br />
  15. 15. EDW Architecture: Generation 1<br />10/6/2011<br /><br />8<br />Enterprise BI Solution<br />(batch)<br />Sales<br />Staging<br />(EDW)<br />Star<br />Schemas<br />Complex <br />Business <br />Rules<br />Finance<br />Conformed Dimensions<br />Junk Tables<br />Helper Tables<br />Factless Facts<br />Staging + History<br />Contracts<br />Complex Business Rules<br />+Dependencies<br />
  16. 16. Kick-Starting Data Warehousing<br />HR Asks IT to build the FIRST Data Warehouse / Prototype<br />10/6/2011<br /><br />9<br />1.<br />2.<br />IT Says… <br />OK: $125k and 90 days…<br />3.<br />HR Says:<br />Great! Get Started<br />
  17. 17. Everyone’s Happy!<br />IT Delivers. On-Time & In Budget!<br />10/6/2011<br /><br />10<br />4.<br />5.<br />HR Says:<br />Thank-you! We’re Happy!<br />First Star!<br />Customer_ID<br />Customer_Name<br />Customer_Addr<br />Customer_Addr1<br />Customer_City<br />Customer_State<br />Customer_Zip<br />Customer_Phone<br />Customer_Tag<br />Customer_Score<br />Customer_Region<br />Customer_Stats<br />Customer_Phone<br />Customer_Type<br />Customer_ID<br />Customer_Name<br />Customer_Addr<br />Customer_Addr1<br />Customer_City<br />Customer_State<br />Customer_Zip<br />Customer_Phone<br />Customer_Tag<br />Customer_Score<br />Customer_Region<br />Customer_Stats<br />Customer_Phone<br />Customer_Type<br />Customer_ID<br />Customer_Name<br />Customer_Addr<br />Customer_Addr1<br />Customer_City<br />Customer_State<br />Customer_Zip<br />Customer_Phone<br />Fact_ABC<br />Fact_DEF<br />Fact_PDQ<br />Fact_MYFACT<br />Customer_ID<br />Customer_Name<br />Customer_Addr<br />Customer_Addr1<br />Customer_City<br />Customer_State<br />Customer_Zip<br />Customer_Phone<br />Customer_Tag<br />Customer_Score<br />Customer_Region<br />Customer_Stats<br />Customer_Phone<br />Customer_Type<br />Customer_ID<br />Customer_Name<br />Customer_Addr<br />Customer_Addr1<br />Customer_City<br />Customer_State<br />Customer_Zip<br />Customer_Phone<br />Customer_Tag<br />Customer_Score<br />Customer_Region<br />Customer_Stats<br />Customer_Phone<br />Customer_Type<br />
  18. 18. So Where’s the PAIN?<br />10/6/2011<br /><br />11<br />
  19. 19. The PAIN is RIGHT HERE!!<br />Contracts Sees Success, wants the same for their systems.<br />10/6/2011<br /><br />12<br />1.<br />2.<br />IT Says… Ok, but… <br />It won’t be $125k and 90 days…<br />Because we have to “merge it” with HR” it will be $250 and 180 days.<br />3.<br />Contracts Says:<br />Ouch! That’s not reasonable, but we need it, so go ahead…<br />
  20. 20. And HERE….<br />10/6/2011<br /><br />13<br />Finance, Sales, and Marketing want in….<br />IT Says… Ok, but… <br />It won’t be $250k and 90 days… Because we have to “merge it” with HR and Contracts it will be $350k and 250 days.<br />And this continues….<br />Business Says...<br />“Can’t you just make-a-copy of the Star Schema, and give me my own for cheaper & less time?<br />
  21. 21. Silo Building / IT Non-Agility<br />10/6/2011<br /><br />14<br />First Star<br />SALES<br />We built our own because IT costs too much<br />FINANCE<br />We built our own because IT took too long<br />MARKETING<br />We built our own because we need customized dimension data<br />Why is this happening? What’s Causing this Problem?<br />
  22. 22. Root Cause of Pain: Re-Engineering!<br />10/6/2011<br /><br />15<br />IT is forced to Re-EngineerETL loading code + SQL BI Queries WHENEVER:<br /><ul><li>WHENEVER table structures change
  23. 23. New systems are introduced</li></ul>1. Adding fields to Dimensions<br /><ul><li>Business Rules Change
  24. 24. (causing ETL Loading to change, and forcing Engineers to RELOAD existing data)</li></ul>Customer_ID<br />Customer_Name<br />Customer_Addr<br />Customer_Addr1<br />Customer_City<br />Customer_State<br />Customer_Zip<br />Customer_Phone<br />Customer_Tag<br />Customer_Score<br />Customer_Region<br />Customer_Stats<br />Customer_Phone<br />Customer_Type<br />Customer_ID<br />Customer_Name<br />Customer_Addr<br />Customer_Addr1<br />Customer_City<br />Customer_State<br />Customer_Zip<br />Customer_Phone<br />Customer_Tag<br />Customer_Score<br />Customer_Region<br />Customer_Stats<br />Customer_Phone<br />Customer_Type<br />Customer_ID<br />Customer_Name<br />Customer_Addr<br />Customer_Addr1<br />Customer_City<br />Customer_State<br />Customer_Zip<br />Customer_Phone<br />Fact_ABC<br />Fact_DEF<br />Fact_PDQ<br />Fact_MYFACT<br />Customer_ID<br />Customer_Name<br />Customer_Addr<br />Customer_Addr1<br />Customer_City<br />Customer_State<br />Customer_Zip<br />Customer_Phone<br />Customer_Tag<br />Customer_Score<br />Customer_Region<br />Customer_Stats<br />Customer_Phone<br />Customer_Type<br />Customer_ID<br />Customer_Name<br />Customer_Addr<br />Customer_Addr1<br />Customer_City<br />Customer_State<br />Customer_Zip<br />Customer_Phone<br />Customer_Tag<br />Customer_Score<br />Customer_Region<br />Customer_Stats<br />Customer_Phone<br />Customer_Type<br />Customer_ID<br />Customer_Name<br />Customer_Addr<br />Customer_Addr1<br />Customer_City<br />Customer_State<br />Customer_Zip<br />Customer_Phone<br />Customer_Tag<br />Customer_Score<br />Customer_Region<br />Customer_Stats<br />Customer_Phone<br />Customer_Type<br />3. Adding Dimensions to Facts<br />2. Adding fields to Facts<br />
  25. 25. Why Re-Engineering?<br />10/6/2011<br /><br />16<br />Adding fields to a conformed <br />dimension….<br />Adding fields to a shared <br />fact….<br />Changing code to match <br />new business rules…<br />Require adding/changing<br />Fields in target tables!<br />Require Re-Engineering!<br />
  26. 26. Other Pains?<br />10/6/2011<br /><br />17<br />Dimension-Itis?<br />IT – Non-Agility?<br />Deformed Dimensions?<br />What about the “data” you don’t see?<br />What about the “BAD” data left in the source systems?<br />
  27. 27. The Solution<br />Go the Data Vault Route!<br />10/6/2011<br /><br />18<br />
  28. 28. EDW Architecture: Generation 2<br />10/6/2011<br /><br />19<br />SOA<br />Enterprise BI Solution<br />Star<br />Schemas<br />(real-time)<br />Sales<br />(batch)<br />DV<br />EDW<br />(batch)<br />Staging<br />Error<br />Marts<br />Finance<br />Contracts<br />Report<br />Collections<br />Business Rules Downstream!<br />(the Lens Filter)<br />
  29. 29. Unstructured Data And Data Vault<br />10/6/2011<br /><br />20<br />Unstructured Data Sets<br />Ontologies/Taxonomies<br />Unstructured <br />Processing Engine<br /><ul><li>Email
  30. 30. Docs
  31. 31. Images
  32. 32. Movies
  33. 33. Sound</li></ul>On-Demand<br />Cubes<br />Joins through LINK Structures<br />Data Vault EDW<br />
  34. 34. IT Agility<br />10/6/2011<br /><br />21<br />RAW<br />“what-is”<br />Star<br />Schemas<br />Complex<br />Business <br />Rules<br />ETL-T<br />Data Vault<br />(EDW)<br />Source<br />Staging<br />Business<br />Driven<br />Star<br />Schemas<br />2. Business Gap Analysis<br /><ul><li>Unknown Time…
  35. 35. Business Requirements
  36. 36. Start new phase</li></ul>1. Fast Load & Fast Integration<br />3. IT Implementation of Business Rules<br />
  37. 37. What are the Facts Jack?<br />10/6/2011<br /><br />22<br />Generation 1 EDW’s tried to provide<br />“One version of the truth”<br />Generation 2 (Data Vaults) provide…<br />“One version of the facts, for each point in time.”<br />
  38. 38. Business Gap Analysis<br />10/6/2011<br /><br />23<br />The Way Business Perceives <br />it’s business to be running<br />Gap<br />Analysis<br />Operational<br />Reports<br />Gap<br />Analysis<br />Dynamic<br />Cubes<br />(Data Marts)<br />The way the source systems see the business running.<br />
  39. 39. Secured/Protected Information Systems<br />10/6/2011<br /><br />24<br />Non-Classified DV<br />Classified Data Vault<br />Hub<br />Sat<br />Hub<br />Data Copy<br />Link<br />Link<br />Sat<br />Sat<br />Sat<br />Model Copy<br />Sat<br />Hub<br />Hub<br />Link<br />Hub<br />Sat<br />Sat<br />Sat<br />Sat<br />Sat<br />Sat<br />Sat<br />Sat<br />Yellow = New Tables<br /><ul><li>Model changes are absorbed seamlessly into the classified system
  40. 40. Classified world can add all their own structures while maintaining congruence with standard unclassified Data Vault</li></li></ul><li>Extensibility Factor<br />10/6/2011<br /><br />25<br />New Additions<br />New Code<br />Billed<br />Amounts<br />Product <br />Shipped<br />Dates<br />Product<br />Quantities<br />Existing EDW<br />No Impact!<br />Product<br />Supplier<br />Link<br />Suppliers<br />Products<br />Descriptions<br />Descriptions<br />Address<br />Availability Dates<br />Stock Quantities<br />Stock Quantities<br />Defect Reasons<br />Rating Score<br />
  41. 41. Where’s the Solution?<br />10/6/2011<br /><br />26<br />Re-Engineering<br />Handle Changes Wherever… Whenever… with EASE!<br />
  42. 42. The Three vehicles…<br />Pros and Cons of the Modeling Methodologies<br />10/6/2011<br /><br />27<br />
  43. 43. 3rd Normal Form Pros/Cons as an EDW<br />PROS (as 3NF)<br />Many to many linkages<br />Handle lots of information<br />Tightly integrated information<br />Highly structured<br />Conducive to near-real time loads<br />Relatively easy to extend<br />10/6/2011<br /><br />28<br />CONS (as EDW)<br />Time driven PK issues<br />Parent-child complexities<br />Cascading change impacts<br />Difficult to load<br />Not conducive to BI tools<br />Not conducive to drill-down<br />Difficult to architect for an enterprise<br />Not conducive to spiral/scope controlled implementation<br />Physical design usually doesn’t follow business processes<br />
  44. 44. Star Schema Pros/Cons as an EDW<br />PROS (as Data Mart)<br />Good for multi-dimensional analysis<br />Subject oriented answers<br />Excellent for aggregation points<br />Rapid development / deployment<br />Great for some historical storage<br />10/6/2011<br /><br />29<br />CONS (as EDW)<br />Not cross-business functional<br />Use of junk / helper tables<br />Trouble with VLDW<br />Unable to provide integrated enterprise information<br />Can’t handle ODS or exploration warehouse requirements<br />Trouble with data explosion in near-real-time environments<br />Trouble with updates to type 2 dimension primary keys<br />Trouble with late arriving data in dimensions to support real-time arriving transactions<br />Not granular enough information to support real-time data integration<br />
  45. 45. Data Vault Pros/Cons as an EDW<br />PROS (as EDW)<br />Supports near-real time and batch feeds<br />Supports functional business linking<br />Extensible / flexible<br />Provides rapid build / delivery of star schema’s<br />Supports VLDB / VLDW<br />Designed for EDW<br />Supports data mining and AI<br />Provides granular detail<br />Incrementally built<br />10/6/2011<br /><br />30<br />CONS (as EDW)<br />Not conducive to OLAP processing<br />Requires business analysis to be firm<br />Introduces many join operations<br />
  46. 46. The Three Vehicles…<br />Which would you use to win a race?<br />Which would you use to move a house?<br />Would you adapt the truck and enter a race with Porches and expect to win?<br />10/6/2011<br /><br />31<br />
  47. 47. #1 complaint about DV architecture<br />So you want to deal with Joins do you?<br />10/6/2011<br /><br />32<br />
  48. 48. Joins, Everywhere!<br />10/6/2011<br /><br />33<br />Yes, the DV is full of joins but…<br />These are highly normalized tables (thin & Narrow), reducing I/O’s to read large numbers of rows, at high speed, in parallel. Joins occur in RAM instead of on disk. The Optimizer is given a chance to “drop tables” from the join that aren’t necessary.<br />When Parallelism is too much…<br /><ul><li>Not enough CPU or RAM to handle the extra work-load
  49. 49. Not enough rows being queried, (the overhead of starting the threads takes longer than an original scan.</li></ul>End Result? The DV Scales to the Petabyte Levels when necessary…<br />
  50. 50. Mathematics Behind the Data Vault Model<br />*** The Data Vault is BACKED by Mathematical Principles***<br />Parallel versus sequential execution models<br />Set Logic<br />I/O Bandwidth & Throughput<br />Compression (for query performance gains)<br />Process Repeatability (tuning & predictability measurements)<br />RAM versus electromagnetic disk (Solid-State Drives are not measured)<br /><br />10/6/2011<br /><br />34<br />
  51. 51. Know when to hold ‘em, know when to fold ‘em<br />When to use DV, and when not…<br />10/6/2011<br /><br />35<br />
  52. 52. The Challenger….<br />10/6/2011<br /><br />36<br />The challenger says:<br /><ul><li>My system works fine, why should I use the Data Vault?
  53. 53. I don’t have volume problems…
  54. 54. I don’t have compliance/auditability problems…
  55. 55. I don’t have real-time problems…
  56. 56. My system produces matching results across lines of business…
  57. 57. I’ve never had to “re-state” the data in the warehouse…
  58. 58. I can still build new marts, and conform dimensions in 30 days or less…
  59. 59. My business doesn’t acquire new systems often (if ever)
  60. 60. My incoming data sets don’t change</li></ul>I Say…<br />That’s wonderful, don’t fix what’s broken. Have a nice day, oh- but call me when or if you ever run into these problems…<br />
  61. 61. When to Apply the Data Vault<br />10/6/2011<br /><br />37<br />Benefits:<br /><ul><li>Scalability
  62. 62. Auditability
  63. 63. Flexibility
  64. 64. Adaptability</li></ul>Leads To…<br /><ul><li>IT Agility
  65. 65. IT and Business Accountability
  66. 66. Reduction in Spread-Marts
  67. 67. Corporate Asset Development
  68. 68. Money Savings
  69. 69. Risk Mitigation
  70. 70. Successful EDW Implementations</li></li></ul><li>How to build a data vault<br />In 10 easy steps…<br />10/6/2011<br /><br />38<br />
  71. 71. Step 1<br />10/6/2011<br /><br />39<br />Identify your business processes, followed by your business keys (that are used to identify the data that flows through the business processes)<br />** NOTE: Along the way, document your assumptions, document your reasons for choosing keys, and modeling designs, develop a list of questions to be answered by business users…<br />
  72. 72. Step 2<br />10/6/2011<br /><br />40<br />Identify the issues/problems that might be carried with the identified business keys, annotate the risks, and mitigate each one.<br />
  73. 73. Step 3<br />10/6/2011<br /><br />41<br />Identify the units of work, the associations – LINK tables, where keys combine to form a notion, a concept, and a relationship.<br />
  74. 74. Step 4<br />10/6/2011<br /><br />42<br />Identify the descriptive data that belongs to SINGLE Hub Keys, ensure that the data doesn’t represent or rely on a relationship.<br />
  75. 75. Step 5<br />10/6/2011<br /><br />43<br />Identify the Satellite data that depends on relationships – move it to the appropriate LINK table.<br />HINT: If you “want” to put a Foreign Key in a Satellite, you have a clear sign that the Satellite is in the WRONG place, and needs to be assigned to a LINK table rather than a HUB.<br />
  76. 76. Step 6<br />10/6/2011<br /><br />44<br />Scope the Model Down to a managable chunk. Implement the first two Hubs, Hub Satellites, and first Link. BUILD IN INCREMENTS!<br />
  77. 77. Step 7<br />10/6/2011<br /><br />45<br />Setup the key generation load routines, setup the staging area, and begin loading data.<br />
  78. 78. Step 8<br />10/6/2011<br /><br />46<br />Review any “truncation” errors, or any data-type conversion problems, fix the staging area, and remove duplicates.<br />
  79. 79. Step 9<br />10/6/2011<br /><br />47<br />Begin Loading the Data Vault. Load all Hubs, then all Hub Satellites, Then all Links, and finish with All Link Satellites.<br />
  80. 80. Step 10<br />10/6/2011<br /><br />48<br />Reconcile the Data Vault to the source system, then build a first data mart from the results. Bring business value FAST!<br />
  81. 81. Instructor led lab<br />10/6/2011<br /><br />49<br />
  82. 82. 10 minutes to find the Hubs….<br />10/6/2011<br /><br />50<br />
  83. 83. Possible Hubs From Northwind<br />10/6/2011<br /><br />51<br />
  84. 84. 10 Minutes to find the Links…<br />10/6/2011<br /><br />52<br />
  85. 85. Possible Links From Northwind<br />10/6/2011<br /><br />53<br />
  86. 86. 10 minutes to find the Satellites…<br />10/6/2011<br /><br />54<br />
  87. 87. Possible Satellites From Northwind<br />10/6/2011<br /><br />55<br />
  88. 88. What did we learn?<br />We often deal with more than 1 system at a time… this was a lab with only one model.<br />We didn’t have any business requirements that we might need to answer questions, but doesn’t that reflect real-life?<br />The data set is extremely dirty (you never have that in your systems right?)<br />Time Zone based data can be a problem<br />Lack of metadata causes integration issues and modeling decisions<br />10/6/2011<br /><br />56<br />
  89. 89. The Experts Say…<br />“The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework.” <br />Bill Inmon<br />“The Data Vault is foundationally strong and exceptionally scalable architecture.”<br />Stephen Brobst<br />“The Data Vault is a technique which some industry experts have predicted may spark a revolution as the next big thing in data modeling for enterprise warehousing....” <br />Doug Laney<br />57<br />
  90. 90. More Notables…<br />“This enables organizations to take control of their data warehousing destiny, supporting better and more relevant data warehouses in less time than before.” <br />Howard Dresner<br />“[The Data Vault] captures a practical body of knowledge for data warehouse development which both agile and traditional practitioners will benefit from..”<br />Scott Ambler<br />58<br />
  91. 91. Where To Learn More<br />The Technical Modeling Book:<br />The Discussion Forums: & events – Data Vault Discussions<br />Contact me: - web - email<br />World wide User Group (Free)<br />59<br />