The Application of Data Vault to DW2.0© Dan Linstedt, 2011-2012 all rights reserved
A bit about me…2Author, Inventor, Speaker – and part time photographer…25+ years in the IT industryWorked in DoD, US Gov’t, Fortune 50, and so on…Find out more about the Data Vault:http://www.youtube.com/LearnDataVaulthttp://LearnDataVault.comFull profile on http://www.LinkedIn.com/dlinstedt
AgendaDefining The Needs for the Data VaultDW2.0 ArchitectureDW2.0 Drivers for Data ModelingDivergence of Data Models over TimeData Vault in DW2.0Defining the Data VaultWhat does one look like?Modeling in DW2.0Applying Data Vault to Global DW2.0Applying Data Vault to Time-Value DW2.0Compliance in DW2.0Applying Data Vault to System of RecordThe Paradox of DW2.0Volume, Latency, Complexity,Normalization andTransformation ability10/5/2011Do Not Duplicate Without Written Permission3
DW2.0 Architecture10/5/2011Do Not Duplicate Without Written Permission4Enterprise Service BusESB Connectivity:EAI
EII
ETL / ELT
Web ServicesCube ProcessingTemporalIndexingSemanticManagementActive Data MiningTransformationActiveCleansingUnstructured Data:Email
Plain Text
Word Docs
ImagesMETADATAInteractiveTacticalData Models Must be consistently applied throughout all layers.IntegratedStrategicESB Management:Text
Email
Spread Sheets
Transaction
Structured InformationNear-LineExtendedArchivalHistoricalEnterprise Data Warehouse
DW2.0 Drivers for Data Modeling10/5/2011Do Not Duplicate Without Written Permission5Technical DriversBusiness DriversFlexibilityComplianceVolumeFrequencyDataModelDataModelUnderstandabilityGranularityData Models are one of the main integration points between Technical and Business drivers.Business Keys drive understandability, and granularityNormalization drives flexibility, and frequency of loadRaw data sets in the EDW/ADW drive compliance and volume
Divergence of Data Models over TimeData models (both logical and physical) have diverged from business drivers and direction over time.The Data Models have driven towards physical improvements instead of towards business improvements.The Data Vault Architecture drives data modeling back to the business sides of the house.10/5/2011Do Not Duplicate Without Written Permission6
AgendaDefining The Needs for the Data VaultDW2.0 ArchitectureDW2.0 Drivers for Data ModelingDivergence of Data Models over TimeData Vault in DW2.0Defining the Data VaultWhat does one look like?Modeling in DW2.0Applying Data Vault to Global DW2.0Applying Data Vault to Time-Value DW2.0Compliance in DW2.0Applying Data Vault to System of RecordThe Paradox of DW2.0Volume, Latency, Complexity,Normalization andTransformation ability10/5/2011Do Not Duplicate Without Written Permission7Image is from - What The Bleep Do We Know?
Defining the Data Vault10/5/2011Do Not Duplicate Without Written Permission8The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.Defining the Data VaultTDAN.com Article
What Does One Look Like?10/5/2011Do Not Duplicate Without Written Permission9Records a history of the interactionAccount InformationSatSatSatLinkAccountF(x)F(x)SatSatInvoiceIDSatF(x)SatInvoice / Billing InformationCustomer InformationSatElements:Hub
Link
SatelliteSatCustomerF(x)SatThe impact of linking disparate systems together, is inside the shaded area.
Modeling in DW2.0Bill Says:DW2.0 must be brought down to a very finite level of detail.The starting point for DW2.0 is the modeling process.The data model applies to the integrated sector, the near line sector, and the archival sector.The way that data warehouses are built is in an incremental mannerThe Data Vault specializes in:Providing finite grain at the lowest level possible,Mapping business process models to data modelsExisting in all sectors simultaneously without changes.Flexibility and managing change so that impacts are not a mile-wide and 10 miles deep.10/5/2011Do Not Duplicate Without Written Permission10
Elements in a Data VaultHubUnique List of Business Keys, tracked by the first time the warehouse saw them appear.LinkRelationships between business keys, also representing a grain shift, or a hierarchical roll-up.SatelliteData over time, granular, and descriptive about the business key.  Also setup according to type of information, and rate of change.10/5/2011Do Not Duplicate Without Written Permission11
Applying the Data Vault to Global DW2.010/5/2011Do Not Duplicate Without Written Permission12Manufacturing EDW in ChinaPlanning in BrazilHubHubLinkSatSatLinkSatSatLinkHubLinkHubHubSatSatSatSatSatSatSatSatBase EDW Created in CorporateFinancials in USA
Applying the Data Vault to Time-Value DW2.010/5/2011Do Not Duplicate Without Written Permission13Satellite Data Over TimeRow 1Row 2Row 3Row 4Satellite entities in the Data Vault house data over time.  They are split by type of information and rate of change.  This is an example set of data for a customer name satellite.

Data Vault and DW2.0

  • 1.
    The Application ofData Vault to DW2.0© Dan Linstedt, 2011-2012 all rights reserved
  • 2.
    A bit aboutme…2Author, Inventor, Speaker – and part time photographer…25+ years in the IT industryWorked in DoD, US Gov’t, Fortune 50, and so on…Find out more about the Data Vault:http://www.youtube.com/LearnDataVaulthttp://LearnDataVault.comFull profile on http://www.LinkedIn.com/dlinstedt
  • 3.
    AgendaDefining The Needsfor the Data VaultDW2.0 ArchitectureDW2.0 Drivers for Data ModelingDivergence of Data Models over TimeData Vault in DW2.0Defining the Data VaultWhat does one look like?Modeling in DW2.0Applying Data Vault to Global DW2.0Applying Data Vault to Time-Value DW2.0Compliance in DW2.0Applying Data Vault to System of RecordThe Paradox of DW2.0Volume, Latency, Complexity,Normalization andTransformation ability10/5/2011Do Not Duplicate Without Written Permission3
  • 4.
    DW2.0 Architecture10/5/2011Do NotDuplicate Without Written Permission4Enterprise Service BusESB Connectivity:EAI
  • 5.
  • 6.
  • 7.
    Web ServicesCube ProcessingTemporalIndexingSemanticManagementActiveData MiningTransformationActiveCleansingUnstructured Data:Email
  • 8.
  • 9.
  • 10.
    ImagesMETADATAInteractiveTacticalData Models Mustbe consistently applied throughout all layers.IntegratedStrategicESB Management:Text
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    DW2.0 Drivers forData Modeling10/5/2011Do Not Duplicate Without Written Permission5Technical DriversBusiness DriversFlexibilityComplianceVolumeFrequencyDataModelDataModelUnderstandabilityGranularityData Models are one of the main integration points between Technical and Business drivers.Business Keys drive understandability, and granularityNormalization drives flexibility, and frequency of loadRaw data sets in the EDW/ADW drive compliance and volume
  • 16.
    Divergence of DataModels over TimeData models (both logical and physical) have diverged from business drivers and direction over time.The Data Models have driven towards physical improvements instead of towards business improvements.The Data Vault Architecture drives data modeling back to the business sides of the house.10/5/2011Do Not Duplicate Without Written Permission6
  • 17.
    AgendaDefining The Needsfor the Data VaultDW2.0 ArchitectureDW2.0 Drivers for Data ModelingDivergence of Data Models over TimeData Vault in DW2.0Defining the Data VaultWhat does one look like?Modeling in DW2.0Applying Data Vault to Global DW2.0Applying Data Vault to Time-Value DW2.0Compliance in DW2.0Applying Data Vault to System of RecordThe Paradox of DW2.0Volume, Latency, Complexity,Normalization andTransformation ability10/5/2011Do Not Duplicate Without Written Permission7Image is from - What The Bleep Do We Know?
  • 18.
    Defining the DataVault10/5/2011Do Not Duplicate Without Written Permission8The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.Defining the Data VaultTDAN.com Article
  • 19.
    What Does OneLook Like?10/5/2011Do Not Duplicate Without Written Permission9Records a history of the interactionAccount InformationSatSatSatLinkAccountF(x)F(x)SatSatInvoiceIDSatF(x)SatInvoice / Billing InformationCustomer InformationSatElements:Hub
  • 20.
  • 21.
    SatelliteSatCustomerF(x)SatThe impact oflinking disparate systems together, is inside the shaded area.
  • 22.
    Modeling in DW2.0BillSays:DW2.0 must be brought down to a very finite level of detail.The starting point for DW2.0 is the modeling process.The data model applies to the integrated sector, the near line sector, and the archival sector.The way that data warehouses are built is in an incremental mannerThe Data Vault specializes in:Providing finite grain at the lowest level possible,Mapping business process models to data modelsExisting in all sectors simultaneously without changes.Flexibility and managing change so that impacts are not a mile-wide and 10 miles deep.10/5/2011Do Not Duplicate Without Written Permission10
  • 23.
    Elements in aData VaultHubUnique List of Business Keys, tracked by the first time the warehouse saw them appear.LinkRelationships between business keys, also representing a grain shift, or a hierarchical roll-up.SatelliteData over time, granular, and descriptive about the business key. Also setup according to type of information, and rate of change.10/5/2011Do Not Duplicate Without Written Permission11
  • 24.
    Applying the DataVault to Global DW2.010/5/2011Do Not Duplicate Without Written Permission12Manufacturing EDW in ChinaPlanning in BrazilHubHubLinkSatSatLinkSatSatLinkHubLinkHubHubSatSatSatSatSatSatSatSatBase EDW Created in CorporateFinancials in USA
  • 25.
    Applying the DataVault to Time-Value DW2.010/5/2011Do Not Duplicate Without Written Permission13Satellite Data Over TimeRow 1Row 2Row 3Row 4Satellite entities in the Data Vault house data over time. They are split by type of information and rate of change. This is an example set of data for a customer name satellite.
  • 26.
    Batch and Real-TimeData Arrival10/5/2011Do Not Duplicate Without Written Permission14All InsertsAll the timeTransaction IDDate StampCustomerAccount #AmountSatTransactionTypeHub CustomerLinkTransactionHub AcctSatCustomerSatAcct3, 6 or 12 Hr Load WindowBatch LoadCustomer InfoAcct Data
  • 27.
    Star Schema Real-TimeData Issues10/5/2011Do Not Duplicate Without Written Permission15Updates areREQUIRED!Transaction IDDate StampCustomerAccount #AmountType3, 6 or 12 Hr Load WindowDimensionCustomerFactTransactionDimensionAccountBatch LoadCustomer InfoAcct DataCleansing & Quality must occur before the data can reach the target tables, cleansing and quality introduce unwanted latency!
  • 28.
    Compliance in DW2.010/5/2011DoNot Duplicate Without Written Permission16Changes to Source InformationSource SystemsEDW / ADWData VaultData MartsData DeliveryRaw Detail = auditableLoads in Real-Time or in BatchIntegrated by Business KeyFlexible, allows business changes (with little to no impact)No delay in loading dataData type conformitySemantic IntegrationTrueMartsRawIntegrationBusinessRulesUser orAuditorContinuous Data ImprovementErrorMartQualityDirection of Information FlowMaster Data(Operational)
  • 29.
    Applying the DataVault to System Of Record10/5/2011Do Not Duplicate Without Written Permission17Master Data orConformed DimensionsNormalized EDWSource SystemsSORDefinition 2SORDefinition 3SORDefinition 1SOR 1 Data Capture, Data Produced by system algorithmsSOR 2Raw Detailed Integrated Data over time, Integrated by Horizontal (functional) Business Key. Auditable.SOR 3Current view of the business, merged, quality cleansed, single copy, single source, feeds operational systems.
  • 30.
    DW2.0 ParadoxesDW2.0 incorporates:Unstructured,Semi-Structured, Real-Time, and Batch DataGlobal viewsAll of which drive volumes of data.Volume causes latency in transformation.Volume is directly proportional to transformation complexity.Real-Time data arrival is inversely proportional to complexity and volume.Time for “quality, cleansing, and transformation” on the way in to the EDW diminishes as near-real-time is approached, or massive volumes of batch data are found within a shrinking batch window.Transformation can destroy data audit ability and compliance of the EDW / ADW.10/5/2011Do Not Duplicate Without Written Permission18
  • 31.
    DW2.0 Paradoxes -Imagery10/5/2011Do Not Duplicate Without Written Permission19DrivesDW2.0Real-TimeTransactionsUnstructuredDataLow-LevelGrainPushesIncreasesLowLatencyVolumeFightsRequiresMerging, Quality,CleansingFightsData ModelDenormalizationFightsData ModelNormalization& Raw DetailsInhibitsRequiresInhibitsAuditability & ComplianceProvides
  • 32.
    DW2.0 Paradox HypothesisAswe reach near-real time, the ability to transform data and “wait” for parent dependencies directly decreases, the data decay rates increase, and therefore can cause data death if not processed in time.Normalization of the data model increases flexibility, and scalability.The closer we get to near-real-time, the more normalized the data model in the EDW/ADW must become.In order to process high volumes of batch data extremely fast, the “business transformations” must be removed from the load stream of the EDW.10/5/2011Do Not Duplicate Without Written Permission20
  • 33.
    Data Vault Volumetrics10/5/2011DoNot Duplicate Without Written Permission21Volumetrics (10% null Data)Upon Initial Investigation, the 12 month growth rate for new customers is 197.4 MB per year…. Now let’s factor in the DELTA’s.
  • 34.
    Data Vault Growth10/5/2011DoNot Duplicate Without Written Permission22Volumetrics (10% null Data) – Delta Growth OnlyOriginal Dimension: 497.16 MB per YearNew Data Vault:317.03 MB Per Year
  • 35.
    Data Vault VSDimension Growth10/5/2011Do Not Duplicate Without Written Permission23How does the extensive growth rate affect queries?
  • 36.
    SummarizationBusiness:Lack of asingle view of a customer, product, service, etc...Lack of visibility into ALL information across the enterprise.Competition does it better, faster, cheaper.Unable to identify and forecast business trends and their impacts.WHERE’S THE KNOWLEDGE? OR IS IT JUST ALL DATA?10/5/2011Do Not Duplicate Without Written Permission24Technical:Near-Real-Time (Active)Huge Data VolumesMassive Data Dis-IntegrationSpread-MartsConvergence of Operational and Strategic QuestionsDuplication of data in the ODS, Warehouse, and Data Marts!Dimension-itis!!ODS Ulcer!Fact Table GranularityJUNK tables, Helper Tables
  • 37.
    Where To LearnMoreThe Technical Modeling Book: http://LearnDataVault.comThe Discussion Forums: & eventshttp://LinkedIn.com – Data Vault DiscussionsContact me:http://DanLinstedt.com - web siteDanLinstedt@gmail.com - emailWorld wide User Group (Free)http://dvusergroup.com25