Introduction To Data Vault - DAMA Oregon 2012

8,622 views
8,736 views

Published on

DAMA, Oregon Chapter, 2012 presentation - an introduction to Data Vault modeling. I will be covering parts of the methodology, comparison and contrast of issues in general for the EDW space. Followed by a brief technical introduction of the Data Vault modeling method.

After the presentation i I will be providing a demonstration of the ETL loading layers, LIVE!

You can find more on-line training at: http://LearnDataVault.com/training

Published in: Business, Technology
1 Comment
7 Likes
Statistics
Notes
  • Good presentation, albeit with some random financials thrown in. I do not agree with the last point on slide 14 as a 'con' for using star schemas to model an EDW: 'Not granular enough information to support real-time data integration'. Star schemas are naturally at the lowest level of granularity unless they were incorrectly defined or purposefully aggregated.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
8,622
On SlideShare
0
From Embeds
0
Number of Embeds
305
Actions
Shares
0
Downloads
290
Comments
1
Likes
7
Embeds 0
No embeds

No notes for slide
  • You’re not the first, nor will you be the last one to use it.Some of the worlds biggest companies are implementing Data Vaults.From Diamler Motors to Lockheed Martin, to the Department of Defense.JPMorgan and Chase used the Data Vault model to merge 3 companies in 90 days!
  • Beginning: 5 advanced ETLBy the 1st month, they 5 advanced, and 15 basic/introBy the 6th month, they 5 advanced, but 50 basicBy the end of the 8th month they went to production with 10 MF sourcesAnd their team size was: 12 people (5 advanced, 7 basic – for support).
  • Introduction To Data Vault - DAMA Oregon 2012

    1. 1. IntroductionData Vault Model & Methodology © Dan Linstedt, 2011-2012 all rights reserved Prepared for: DAMA Oregon, July 2012 1
    2. 2. Who’s Using It? 2
    3. 3. The Experts Say… “The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework.” Bill Inmon “The Data Vault is foundationally strong and exceptionally scalable architecture.” Stephen Brobst “The Data Vault is a technique which some industry experts have predicted may spark a revolution as the next big thing in data modeling for enterprise warehousing....” Doug Laney 3
    4. 4. More Notables… “This enables organizations to take control of their data warehousing destiny, supporting better and more relevant data warehouses in less time than before.” Howard Dresner “[The Data Vault] captures a practical body of knowledge for data warehouse development which both agile and traditional practitioners will benefit from..” Scott Ambler 4
    5. 5. Agenda• Introduce Yourselves…• What is a Data Vault? Where does it come from?• Pros & Cons of Data Modeling for EDW• Current EDW Issues & Pains• Consequences of Implementing the Pains…• How do we “Fix” This?• Keys to Success• When “NOT” to use a Data Vault• Ontologies and Data Vault• A Working Example• Query Performance (PIT & Bridge)• Conclusion (break)• Live Demo 5
    6. 6. Introduce Yourselves• Your Expectations?• Your Questions?• Your Background?• Areas of Interest?• What are the top 3 pains your EDW/BI solution is experiencing?• About Me… o http://www.LinkedIn.com/dlinstedt• Learn More Data Vault on-line at: o http://LearnDataVault.com/training 6
    7. 7. Where did it come from? What is it? Defining the Data Vault Space 7
    8. 8. Data Warehousing Time Line The Data Vault Model & Methodology took 10 years of R&D to become consistent, flexible, and scalable. 8
    9. 9. What IS a Data Vault? (Business Definition)• Data Vault Model • Data Vault Methodology o Detail oriented – CMMI, Project Plan o Historical traceability – Risk, Governance, Versioning o Uniquely linked set of – Peer Reviews, Release Cycles normalized tables – Repeatable, Consistent, Optimized o Supports one or more – Complete with Best Practices for functional areas of business BI/DW • Data Vault Architecture – 3 Tier Architecture (for including Batch & Unstructured Data) – 2 Tier Architecture (for Real-Time only) 9
    10. 10. The Data Vault Model Records a history Customer of the interaction Product Sat SatElements: Sat•Hub•Link Sat Customer Link Product Sat•Satellite F(x) Sat F(x) F(x) Sat SatHub = List of Unique Business Keys Order SatLink = List of Relationships, AssociationsSatellites = Descriptive Data F(x) Sat Order 10
    11. 11. Data Vault MethodologyFollows: SEI/CMMI Level 5, PMP, Six Sigma, TQM, and Agile elements Optimized business5 processes, repeatable, scalable, fault-tolerant. Automatable (generatable) Metrics, Estimates vs Actuals, Function Point4 Analysis, Identification of broken processes Defined Business Processes, Defined3 Goals, Defined Objectives Risk assessments / analysis, managed2 processes, basic alignment efforts Process unpredictable and1 poorly controlled 11
    12. 12. Data Vault Architecture SOA Enterprise BI Solution Star Sales Schemas (batch) (real-time) Finance Staging (batch) EDW (Data Vault) Error Marts Contracts Unstructured Complex Report Data Business Collections (Hadoop NoSQL) RulesFUNDAMENTAL GOALS•Repeatable •Scalable The business rules are moved closer to the business,•Consistent •Auditable improving IT reaction time, reducing cost and minimizing•Fault-tolerant impacts to the enterprise data warehouse (EDW)•Supports phased release 12
    13. 13. Star Schemas, 3NF, Data Vault: Pros & Cons Defining the Data Vault Space Why NOT use Star Schemas as an EDW? Why NOT use 3NF as an EDW?Why NOT use Data Vault as a Data Delivery Model? 13
    14. 14. Star Schema Pros/Cons as an EDWPROS CONS• Good for multi-dimensional • Not cross-business functional analysis • Use of junk / helper tables• Subject oriented answers • Trouble with VLDW• Excellent for aggregation points • Unable to provide integrated enterprise information• Rapid development / • Can’t handle ODS or deployment exploration warehouse• Great for some historical storage requirements • Trouble with data explosion in near-real-time environments • Trouble with updates to type 2 dimension primary keys • Trouble with late arriving data in dimensions to support real- time arriving transactions • Not granular enough information to support real- time data integration 14
    15. 15. 3nf Pros/Cons as an EDWPROS CONS• Many to many linkages • Time driven PK issues• Handle lots of information • Parent-child complexities• Tightly integrated information • Cascading change impacts• Highly structured • Difficult to load• Conducive to near-real time • Not conducive to BI tools loads • Not conducive to drill-down• Relatively easy to extend • Difficult to architect for an enterprise • Not conducive to spiral/scope controlled implementation • Physical design usually doesn’t follow business processes 15
    16. 16. Data Vault Pros/Cons as an EDWPROS CONS• Supports near-real time and • Not conducive to OLAP batch feeds processing• Supports functional business linking • Requires business analysis• Extensible / flexible to be firm• Provides rapid build / delivery of • Introduces many join star schema’s operations• Supports VLDB / VLDW• Designed for EDW• Supports data mining and AI• Provides granular detail• Incrementally built 16
    17. 17. Analogy: The Porsche, the SUV and the Big Rig• Which would you use to win a race?• Which would you use to move a house?• Would you adapt the truck and enter a race with Porches and expect to win? 17
    18. 18. Current EDW Issues and Pains Business Rule Processing, Lack of Agility, and Future proofing your new solution 18
    19. 19. Current EDW Project IssuesThis is NOT whatyou want happeningto your project! THE GAP!! 19
    20. 20. 2 Tier EDW Architecture Enterprise BI Solution Sales (batch) Staging Complex Star Finance (EDW) Business Schemas Rules #2 Conformed Dimensions Junk Tables Contracts Complex Staging + History Helper Tables Business Factless Facts Rules +Dependencies•Quality routines •High risk of incorrect data aggregation•Cross-system dependencies •Larger system = increased impact•Source data filtering •Often re-engineered at the SOURCE•In-process data manipulation •History can be destroyed (completely re-computed) 20
    21. 21. #1 Cause of BI Initiative Failure Let’s take a look at one example… 21
    22. 22. Re-Engineering Business Rules Data Flow (Mapping)Current Sources Sales Customer Source Join Finance Customer Transactions Customer Purchases ** NEW SYSTEM** 22
    23. 23. Federated Star Schema Inhibiting Agility Data Mart 3 High Data Mart 2 Effort & Cost Data Mart 1 Changing and Adjusting conformed dimensions causes an exponential rise in the cost curve over time Low RESULT: Business builds their own Data Marts! Maintenance Start Time Cycle BeginsThe main driver for this is the maintenance costs, and re-engineering of the existingsystem which occurs for each new “federated/conformed” effort. This increasesdelivery time, difficulty, and maintenance costs. 23
    24. 24. What are the ROOT Causes? The root causes of RE-ENGINEERING are: 24
    25. 25. Consequences ofImplementing the Pains…Business rules up-stream of your EDW and Conforming Dimensions to store ALL history 25
    26. 26. Deformed Dimensions• Deformity: The URGE to continue “slamming data” into an existing conformed dimension until it simply cannot sustain any further changes, the result: a deformed dimension and a HUGE re-engineering cost / nightmare. Business Wants a Change! Business said: Just add that to the existing Dimension, it will be easy right? Business Change V1 Business Change ………………… Business Change Complex ………………… ………………… V2 V3 ………………… ……………… Load ………………… ………………… ……………… ……………… ………………… ………………… ………………… ……………… ………………… ………………… ……………… ………………… ……………… ………………… Complex ……………… ……………… ………………… ………………… ………………… Load ……………… ……………… ……………… ………………… ………………… 90 days, $125k ……………… ……………… Complex ………………… ………………… ……………… ………………… ………………… Load ………………… ………………… ………………… ………………… 120 days, $200k ………………… ………………… ………………… ………………… Re-Engineering the ………………… Load Processes EACH TIME! 180 days, $275k 26
    27. 27. Dimension-itis• DimensionItis: Incurable Disease, the symptoms are the creation of new dimensions because the cost and time to conform existing dimensions with new attributes rises beyond the business ability to pay… …………………... …………………... …………………... …………………... …………………... …………………...…………………... …………………... …………………...…………………... …………………... …………………... …………………... Business Says: …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... Avoid the re-engineering …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... costs, just “copy” the …………………... …………………... …………………... …………………... …………………... …………………... …………………... dimensions and create a new …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... one for …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... OUR department… …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... What happens …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... when we (IT) give …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... in to this? … …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... 27
    28. 28. Result: Silo Data Junkyards!• Business Says: Take the dimension you have, copy it, and change it… This should be cheap, and easy right? Business Change 180 To Modify Existing Star = days, $275k SALES We built our own because IT costs too much… First Star Customer_ID Customer_ID FINANCE Customer_Name Customer_Name Customer_Addr Customer_Addr Customer_Addr1 Customer_Addr1 Customer_City Customer_City Customer_State Customer_State Customer_Zip Customer_Zip Customer_Phone Customer_Phone Customer_Tag Customer_Tag Customer_Score Customer_Region Customer_Stats Customer_Score Customer_Region Customer_Stats We built our own Customer_Phone Customer_ID Customer_Phone Customer_Type Customer_Name Customer_Addr Customer_Type because IT took too Customer_Addr1 Customer_City long… Customer_State Customer_Zip Customer_Phone Fact_ABC Fact_DEF MARKETING Customer_ID Fact_PDQ Customer_ID Customer_Name Fact_MYFACT Customer_Name Customer_Addr Customer_Addr1 Customer_City Customer_State Customer_Addr Customer_Addr1 Customer_City We built our own Customer_State Customer_Zip Customer_Phone Customer_Tag Customer_Zip Customer_Phone Customer_Tag because we needed Customer_Score Customer_Region Customer_Stats Customer_Phone Customer_Score Customer_Region Customer_Stats customized Customer_Phone Customer_Type Customer_Type dimension data… 28
    29. 29. Accountability In Question?Corporate Fraud Accountability Title XI consists of seven sections. Section 1101recommends a name for this title as “Corporate Fraud Accountability Act of 2002”. Itidentifies corporate fraud and records tampering as criminal offenses and joinsthose offenses to specific penalties. It also revises sentencing guidelines andstrengthens their penalties. This enables the SEC to temporarily freeze large orunusual payments. Source HR Mart 1 Business Source Rules Sales Mart Change Staging 2 Data! Source Finance Mart 3 Are changes to data ON THE WAY IN to the EDW equivalent to records tampering? 29
    30. 30. How do we “fix” this?Answer: Move the business rules downstream, AND no-longer be forced to conform dimensions. 30
    31. 31. It’s Not Just a Data Model… 31
    32. 32. Move the Business Rules Downstream• No “Conforming” of Dimensions on the way in to the EDW• Hold on… We do distinguish between HARD and SOFT business rules… 32
    33. 33. Hard & Soft Business Rules Hard Business Rules Soft Business Rules• Data Domain Alignment • Any requirement the (Data Type Matching) business user• Normalization (where states, that, when necessary) applied, CHANGES the data• System Column or CHANGES the meaning Computation of the data (the grain or interpretation) • Simple example that will knock the socks off your feet! 33
    34. 34. Progressive Agility and Responsiveness of IT High Effort & Cost Foundational Base Built New Functional Areas Added Initial DV Build Out Low Maintenance Start Time Cycle Begins Re-Engineering does NOT occur with a Data Vault Model. This keeps costs down, and maintenance easy. It also reduces complexity of the existing architecture. 34
    35. 35. NO Re-EngineeringCurrent Sources Data Vault Sales Stage Customer Copy Hub Customer Finance Stage Customer Link Transactions Copy Transaction Customer Stage Hub Hub Purchases Acct Product NO IMPACT!!! Copy NO RE-ENGINEERING!** NEW SYSTEM** 35
    36. 36. Keys to SuccessBringing the Data Vault to Your Project 36
    37. 37. Key: FlexibilityAdding new components to the EDW has NEAR ZERO impact to:• Existing Loading Processes• Existing Data Model• Existing Reporting & BI Functions• Existing Source Systems• Existing Star Schemas and Data Marts 37
    38. 38. Case In Point: Result of flexibility of the Data Vault Model allowed them to merge 3 companies in 90 days – that is ALL systems, ALL DATA! 38
    39. 39. Key: Scalability in Architecture Scaling is easy, its based on the following principles • Hub and spoke design • MPP Shared-Nothing Architecture • Scale Free Networks 39
    40. 40. Case In Point: Result of scalability was to produce a Data Vault model that scaled to 3 Petabytes in size, and is still growing today! 40
    41. 41. Key: Scalability in Team Size You should be able to SCALE your TEAM as well! With the Data Vault methodology, you can: Scale your team when desired, at different points in the project! 41
    42. 42. Case In Point: (Dutch Tax Authority) Result of scalability was to increase ETL developers for each new source system, and reassign them when the system was completely loaded to the Data Vault 42
    43. 43. Key: ProductivityIncreasing Productivity requires a reduction in complexity.The Data Vault Model simplifies all of the following:• ETL Loading Routines• Real-Time Ingestion of Data• Data Modeling for the EDW• Enhancing and Adapting for Change to the Model• Ease of Monitoring, managing and optimizing processes 43
    44. 44. Case in Point: Result of Productivity was: 2 people in 2 weeks merged 3 systems, built a full Data Vault EDW, 5 star schemas and 3 reports. These individuals generated: • 90% of the ETL code for moving the data set • 100% of the Staging Data Model • 75% of the finished EDW data Model • 75% of the star schema data model 44
    45. 45. The Competing Bid?The competition bid this with 15 peopleand 3 months to completion, at a cost of$250k! (they bid a Very complex system)Our total cost? $30k and 2 weeks! 45
    46. 46. Results?Changing the direction of the river takes less effort than stopping the flow of water 46
    47. 47. 47
    48. 48. When NOT to use the Data VaultA review of some reasons why not to use a Data Vault Model 48
    49. 49. When NOT to Use the Data Vault• You have: o a small set of point solution requirements o a very short time-frame for delivery o To use the data one-time, then throw it away o a single source system, single source application o A single business analyst in the entire company• You do NOT have: o audit requirements forcing you to keep history o multiple data center consolidation efforts o near-real-time to worry about o massive batch data to integrate o External data feeds outside your control o Requirements to do trend analysis of all your data o Pain – that forces you to reengineer every time you ask for a change to your current data warehousing systems 49
    50. 50. Ontologies & Data Vault Hub, Link, Satellite - Definitions 50
    51. 51. Business Keys = OntologyFirm Name Business Keys should be arranged in an ontology Drug Listing In order to learn the Product Number dependencies of the data set Dose Form Code NDA Application # NOTE: Different Ontologies represent different business views of Drug Label Code the data! Patent Number Patent Use Code 51
    52. 52. Associations = Ontological Hooks Firm Name Firms Generate Drug Listing Product Listings Firms Manufacture Product Number Products Listings for Products are NDA Application # in NDA Applications Business Keys are associated by many linking factors, these links comprise the associations in the hierarchy. 52
    53. 53. Descriptors = Context FirmFirm Name LocationsFirms Generate Listing Drug ListingProduct Listings Formulation Firms Manufacture Product Number Products Product Start & End of Ingredients manufacturing Descriptors provide the context at a specific point in time – they are the warehousing portion of the Data Vault 53
    54. 54. A working Example National Drug Codes + Orange Book of Drug Patent Applicationshttp://www.accessdata.fda.gov/scripts/cder/ndc/default.cfmhttp://www.fda.gov/Drugs/InformationOnDrugs/ucm129662.htm 54
    55. 55. Hub Table Structures SQN = Sequence (insertion order) LDTS = Load Date (when the Warehouse first sees the data)RSRC = Record Source (System + App where the data ORIGINATED) 55
    56. 56. Link Table StructuresNote: A Link is really no different than a factless fact! 56
    57. 57. Satellite Table Structures SQN = Sequence (parent identity number) LDTS = Load Date (when the Warehouse first sees the data) LEDTS = End of lifecycle for superseded recordRSRC = Record Source (System + App where the data ORIGINATED) 57
    58. 58. In Review…• Data Vault is… o A Data Warehouse Model & Methodology o Hub and Spoke Design o Simple, Easy, Repeatable Structures o Comprised of Standards, Rules & Procedures o Made up of Ontological Metadata o AUTOMATABLE!!!• Hubs = Business Keys• Links = Associations / Transactions• Satellites = Descriptors 58
    59. 59. Why do we build Links this way? 59
    60. 60. History Teaches Us… Portfolio The EDW is designed to handle TODAY’S 1Today: relationship, as soon as history is loaded, it M breaks the model! Customer Hub Portfolio 1 Portfolio5 years MFrom now M M Customer Hub Customer Portfolio M10 Years ago 1 Customer This situation forces re-engineering of the model, load routines, and queries! 60
    61. 61. History Teaches Us… Portfolio 1Today: M Hub Portfolio Customer 1 M Portfolio5 years LNK Mfrom now Cust-Port M M Customer 1 Hub Customer Portfolio M10 Years ago This design is flexible, handles 1 past, present, and future relationship changes Customer with NO RE-ENGINEERING! 61
    62. 62. Applying the Data Vault to Global DW Manufacturing EDW Planning in Brazil in China Hub Hub Link Sat Sat Link Sat Sat Link Hub Link Hub HubSat Sat Sat Sat Sat Sat Sat Sat Base EDW Created in Corporate Financials in USA 62
    63. 63. Query PerformancePoint-in-time and Bridge Tables, overcoming query issues 63
    64. 64. PIT Table ArchitectureSatellite: Point In Time PARENT SEQUENCE Primary LOAD DATE Key {Satellite 1 Load Date} {Satellite 2 Load Date} {Satellite 3 Load Date} {…} PIT Sat {Satellite N Load Date} Sat 1 Sat 2 Hub PIT Sat Sat 3 Order Sat 1 Sat 4 Sat 2 Hub Hub Sat 1 Sat 3 Customer Product Sat 2 Link Line Sat 4 Item Satellite Line Item 64
    65. 65. PIT Table ExampleSAT_CUST_CONTACT_NAME SAT_CUST_CONTACT_CELL SAT_CUST_CONTACT_ADDR SQN LOAD_DTS NAME SQN LOAD_DTS CELL SQN LOAD_DTS ADDR 1 10-14-2000 Dan L 1 10-14-2000 999-555-1212 1 08-01-2000 26 Prospect 1 11-01-2000 Dan Linedt 1 10-15-2000 999-111-1234 1 09-29-2000 26 Prosp St. 1 12-31-2000 Dan Linstedt 1 10-16-2000 999-252-2834 1 12-17-2000 28 November 1 10-17-2000 999.257-2837 1 01-01-2001 26 Prospect St 1 10-18-2000 999-273-5555 SQN LOAD_DTS SAT_NAME_LDTS SAT_CELL_LDTS SAT_ADDR_LDTS 1 08-01-2000 NULL NULL 08-01-2000 1 09-01-2000 NULL NULL 08-01-2000 1 10-01-2000 NULL NULL 09-29-2000 1 11-01-2000 11-01-2000 10-18-2000 09-29-2000 1 12-01-2000 11-01-2000 10-18-2000 09-29-2000 1 01-01-2001 12-31-2000 10-18-2000 01-01-2001 Snapshot Date 65
    66. 66. BridgeTable ArchitectureSatellite: Bridge Primary UNIQUE SEQUENCE Key LOAD DATE {Hub 1 Sequence #} {Hub 2 Sequence #} {Hub 3 Sequence #} {Link 1 Sequence #} {Link 2 Sequence #} {…} {Link N Sequence #} {Hub 1 Business Key} {Hub 2 Business Key} {…} Bridge {Hub N Business Key} Sat 1 Sat 2 Hub Hub Link Link Hub Parts Sat 3 Seller Product Sat 4 Satellite Satellite 66
    67. 67. Bridge Table Data Example Bridge Table: Seller by Product by Part SQN LOAD_DTS SELL_SQN SELL_ID PROD_SQN PROD_NUM PART_SQN PART_NUM 1 08-01-2000 15 NY*1 2756 ABC-123-9K 525 JK*2*4 2 09-01-2000 16 CO*24 2654 DEF-847-0L 324 MN*5-2 3 10-01-2000 16 CO*24 82374 PPA-252-2A 9938 DD*2*3 4 11-01-2000 24 AZ*25 25222 UIF-525-88 7 UF*9*0 5 12-01-2000 99 NM*5 81 DAN-347-7F 16 KI*9-2 6 01-01-2001 99 NM*5 81 DAN-347-7F 24 DL*0-5 Snapshot Date 67
    68. 68. Conclusions 68
    69. 69. Where To Learn More• The Technical Modeling Book: http://LearnDataVault.com/• On-Line Training direct from me: http://LearnDataVault.com/training• The Discussion Forums: & events http://LinkedIn.com – Data Vault Discussions• Contact me: http://DanLinstedt.com - web site DanLinstedt@gmail.com - email 69
    70. 70. LIVE DEMONSTRATIONPhysical Demonstration, Loading Processes and Execution 70

    ×