The Data Warehouse LifecycleBart LoweDecision Source Inc.
AgendaDiscuss high level concepts related to Data warehousing
The Data Warehouse	Must…Make information easily accessiblePresent information consistentlyBe adaptive & resilient to changeBe SecureServe as the foundation for decision making
The Business Community Must…Accept and trust the data warehouse if it is to be successful
Data Warehouse Lifecycle
Data Warehouse Lifecycle
Data Warehouse Components
Data Warehouse Components
Source SystemsNot Optimized for Reporting
Data schemas optimized for transactions not queries
Difficult to share data
Typically do not maintain historical dataData Staging AreaConsists of Staging Storage & ETL Processes
This is typically the most difficult and labor intensive component
Data is cleansed & conformed
Typically a normalized data schema
No direct querying is allowed to this componentData Presentation AreaThis is the data that is made available to users & analytical applications
Consists of a series of conformed dimensional data marts
Each Data Mart represents a difference business process.
Dimensional modeling emphasizes simplicity & query performance.Business Intelligence ToolsThese are the tools that are used to query the data
In General only a small subset of users will need true ad-hoc query capability
80-90% of users will used a parameterized analytic systemDimensional Modeling Vocabulary
Fact TableThis is the primary table in a dimensional modelThe measurements of the dimensional model are stored hereEach measurement is tracked at the intersection of several dimensionsThis is the “grain” of the modelMost useful facts are additive
Dimension TableDescriptors of each factTend to have many attributes but fewer rowsTend to be used as query constraints.The better the attribute descriptions the better the warehouseTypically highly denormalized
Star SchemaThis is a fact table joined to a set of dimensionsRelates data in a manner that is familiar to business users.Symmetrical nature allows for many answering many different business questionsOne dimensional model will exist for each business process. A  single data warehouse can have dozens of these models.
Dimensional Modeling Key Concepts
Store the most atomic dataBy storing the most detail data possible you can ensure that users can drill to the level they need. Its OK to provide aggregate facts as well to improve performance.
Conformed DimensionsBy conforming your dimensions you can correlate performance across business processes.Can be very painful (but worth it) if combining data from disparate systems.
Always use an artificial key as the primary keySurrogate Key allow you to:Protect you model from changes in the source systemIntegrate data from multiple sourcesAdd rows that do not exist in the source system.Track changes to dimensions over time.Use Surrogate Keys
A key design consideration is what to do when dimension values change.A change may or may not have business meaning.Three ways to handle changesSlowly Changing Dimensions
Slowly Changing Dimension TypesType ISimply overwrite the old values.Simplest case, used when you don’t care about changes to data.Type IICreate a new dimension row for new valuesExisting facts still relate to old dimension valueUsed when you do care about the historical changes.Type IIIAdd a new column to table to store the new valueRarely used.
Dates are a fundamental Business concept and nearly every DW has a date dimensionThe date dimension is the classic role playing dimension.Allows rollups/filters on any date related attribute such as month/quarter/year Date dimension records still use a surrogate to handle unknown dates.Date Dimensions
Snowflaking is the process of hooking up lookup tables to a dimension.This is in a way re-normalizing the data.Snowflaking is in general discouraged since it adds complexity to the model.Snowflaking
Most relationships are one-to-many.  This is the simplest case.Real world scenarios are often more complex.Many to Many between facts & dimensions are represented by creating a bridge table between the facts and the dimensionMany to Many Relationships
Hierarchies summarize or group the data within the dimension.Typically are de-normalized into the dimension tableHierarchies
There are three types of fact tablesTransactionTracks each transaction as it occurs.Periodic SnapshotCaptures cumulative performance over a specific period of timeOften used for periodic rollupsAccumulating SnapshotUpdated over timeTypes of Fact Tables

The Data Warehouse Lifecycle

  • 1.
    The Data WarehouseLifecycleBart LoweDecision Source Inc.
  • 2.
    AgendaDiscuss high levelconcepts related to Data warehousing
  • 3.
    The Data Warehouse Must…Makeinformation easily accessiblePresent information consistentlyBe adaptive & resilient to changeBe SecureServe as the foundation for decision making
  • 4.
    The Business CommunityMust…Accept and trust the data warehouse if it is to be successful
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    Data schemas optimizedfor transactions not queries
  • 11.
  • 12.
    Typically do notmaintain historical dataData Staging AreaConsists of Staging Storage & ETL Processes
  • 13.
    This is typicallythe most difficult and labor intensive component
  • 14.
    Data is cleansed& conformed
  • 15.
  • 16.
    No direct queryingis allowed to this componentData Presentation AreaThis is the data that is made available to users & analytical applications
  • 17.
    Consists of aseries of conformed dimensional data marts
  • 18.
    Each Data Martrepresents a difference business process.
  • 19.
    Dimensional modeling emphasizessimplicity & query performance.Business Intelligence ToolsThese are the tools that are used to query the data
  • 20.
    In General onlya small subset of users will need true ad-hoc query capability
  • 21.
    80-90% of userswill used a parameterized analytic systemDimensional Modeling Vocabulary
  • 22.
    Fact TableThis isthe primary table in a dimensional modelThe measurements of the dimensional model are stored hereEach measurement is tracked at the intersection of several dimensionsThis is the “grain” of the modelMost useful facts are additive
  • 23.
    Dimension TableDescriptors ofeach factTend to have many attributes but fewer rowsTend to be used as query constraints.The better the attribute descriptions the better the warehouseTypically highly denormalized
  • 24.
    Star SchemaThis isa fact table joined to a set of dimensionsRelates data in a manner that is familiar to business users.Symmetrical nature allows for many answering many different business questionsOne dimensional model will exist for each business process. A single data warehouse can have dozens of these models.
  • 25.
  • 26.
    Store the mostatomic dataBy storing the most detail data possible you can ensure that users can drill to the level they need. Its OK to provide aggregate facts as well to improve performance.
  • 27.
    Conformed DimensionsBy conformingyour dimensions you can correlate performance across business processes.Can be very painful (but worth it) if combining data from disparate systems.
  • 28.
    Always use anartificial key as the primary keySurrogate Key allow you to:Protect you model from changes in the source systemIntegrate data from multiple sourcesAdd rows that do not exist in the source system.Track changes to dimensions over time.Use Surrogate Keys
  • 29.
    A key designconsideration is what to do when dimension values change.A change may or may not have business meaning.Three ways to handle changesSlowly Changing Dimensions
  • 30.
    Slowly Changing DimensionTypesType ISimply overwrite the old values.Simplest case, used when you don’t care about changes to data.Type IICreate a new dimension row for new valuesExisting facts still relate to old dimension valueUsed when you do care about the historical changes.Type IIIAdd a new column to table to store the new valueRarely used.
  • 31.
    Dates are afundamental Business concept and nearly every DW has a date dimensionThe date dimension is the classic role playing dimension.Allows rollups/filters on any date related attribute such as month/quarter/year Date dimension records still use a surrogate to handle unknown dates.Date Dimensions
  • 32.
    Snowflaking is theprocess of hooking up lookup tables to a dimension.This is in a way re-normalizing the data.Snowflaking is in general discouraged since it adds complexity to the model.Snowflaking
  • 33.
    Most relationships areone-to-many. This is the simplest case.Real world scenarios are often more complex.Many to Many between facts & dimensions are represented by creating a bridge table between the facts and the dimensionMany to Many Relationships
  • 34.
    Hierarchies summarize orgroup the data within the dimension.Typically are de-normalized into the dimension tableHierarchies
  • 35.
    There are threetypes of fact tablesTransactionTracks each transaction as it occurs.Periodic SnapshotCaptures cumulative performance over a specific period of timeOften used for periodic rollupsAccumulating SnapshotUpdated over timeTypes of Fact Tables

Editor's Notes

  • #4 These may seem simple but these principles are the foundation for the deign methodology.For business users to be able to navigate the system the tools and most importantly the data must simple and easy to use.Consistency requires a thorough ETL process to cleanse & conform the data.Change is inevitable. We need a design that is resilient to change.Security …Must have the right data in order to support decisions this means up front analysis focuses on the business need
  • #5 Ultimately if any system doesn’t satisfy some business need, it is of no value and is a failure.
  • #9 Go through Each Component.
  • #10 Discuss Each bullet point
  • #11 Discuss Each bullet pointExamples of Cleansing activities:MisspellingsFormattingCapitalization ConformanceEmphasize that users are forbidden from executing queries on these data.3NF Data is to complex for most users.3NF is not optimized for query performance.
  • #12 Discuss Each bullet pointDiscuss what it means to be a conformed data martPoint out that dimensional modeling will be discussion in detail later on
  • #13 Discuss Each bullet pointSpecify the examples in this diagram and what role they play
  • #15 Discuss why additive facts are most usefulDescribe Semi additive factsNote that the primary key is the combo of all the foreign keys. A ROWID add little value and the index probably would be of any us either.
  • #16 Attribute description should avoid cryptic abbreviationsMinimize the use of codesShow the denormalized nature of one of these dimensions.Denormalized dimensions provide the following benefits.Simplified structure for non technical usersBetter query performanceSince dimensions typically have a relatively few number of rows the impact of reduced storage efficiency is minimal
  • #23 Walk through SCD2 example using dimensional model above
  • #24 Walk through the date dimension in the POC example
  • #28 Point out that the POCGLTransaction fact table is a transaction fact tableAnd the budget table is a periodic snapshot.