The Data Warehouse Lifecycle


Published on

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • These may seem simple but these principles are the foundation for the deign methodology.For business users to be able to navigate the system the tools and most importantly the data must simple and easy to use.Consistency requires a thorough ETL process to cleanse & conform the data.Change is inevitable. We need a design that is resilient to change.Security …Must have the right data in order to support decisions this means up front analysis focuses on the business need
  • Ultimately if any system doesn’t satisfy some business need, it is of no value and is a failure.
  • Go through Each Component.
  • Discuss Each bullet point
  • Discuss Each bullet pointExamples of Cleansing activities:MisspellingsFormattingCapitalization ConformanceEmphasize that users are forbidden from executing queries on these data.3NF Data is to complex for most users.3NF is not optimized for query performance.
  • Discuss Each bullet pointDiscuss what it means to be a conformed data martPoint out that dimensional modeling will be discussion in detail later on
  • Discuss Each bullet pointSpecify the examples in this diagram and what role they play
  • Discuss why additive facts are most usefulDescribe Semi additive factsNote that the primary key is the combo of all the foreign keys. A ROWID add little value and the index probably would be of any us either.
  • Attribute description should avoid cryptic abbreviationsMinimize the use of codesShow the denormalized nature of one of these dimensions.Denormalized dimensions provide the following benefits.Simplified structure for non technical usersBetter query performanceSince dimensions typically have a relatively few number of rows the impact of reduced storage efficiency is minimal
  • Walk through SCD2 example using dimensional model above
  • Walk through the date dimension in the POC example
  • Point out that the POCGLTransaction fact table is a transaction fact tableAnd the budget table is a periodic snapshot.
  • The Data Warehouse Lifecycle

    1. 1. The Data Warehouse Lifecycle<br />Bart Lowe<br />Decision Source Inc.<br />
    2. 2. Agenda<br />Discuss high level concepts related to Data warehousing<br />
    3. 3. The Data Warehouse Must…<br />Make information easily accessible<br />Present information consistently<br />Be adaptive & resilient to change<br />Be Secure<br />Serve as the foundation for decision making<br />
    4. 4. The Business Community Must…<br />Accept and trust the data warehouse if it is to be successful<br />
    5. 5. Data Warehouse Lifecycle<br />
    6. 6. Data Warehouse Lifecycle<br />
    7. 7. Data Warehouse Components<br />
    8. 8. Data Warehouse Components<br />
    9. 9. Source Systems<br /><ul><li>Not Optimized for Reporting
    10. 10. Data schemas optimized for transactions not queries
    11. 11. Difficult to share data
    12. 12. Typically do not maintain historical data</li></li></ul><li>Data Staging Area<br /><ul><li>Consists of Staging Storage & ETL Processes
    13. 13. This is typically the most difficult and labor intensive component
    14. 14. Data is cleansed & conformed
    15. 15. Typically a normalized data schema
    16. 16. No direct querying is allowed to this component</li></li></ul><li>Data Presentation Area<br /><ul><li>This is the data that is made available to users & analytical applications
    17. 17. Consists of a series of conformed dimensional data marts
    18. 18. Each Data Mart represents a difference business process.
    19. 19. Dimensional modeling emphasizes simplicity & query performance.</li></li></ul><li>Business Intelligence Tools<br /><ul><li>These are the tools that are used to query the data
    20. 20. In General only a small subset of users will need true ad-hoc query capability
    21. 21. 80-90% of users will used a parameterized analytic system</li></li></ul><li>Dimensional Modeling Vocabulary<br />
    22. 22. Fact Table<br />This is the primary table in a dimensional model<br />The measurements of the dimensional model are stored here<br />Each measurement is tracked at the intersection of several dimensions<br />This is the “grain” of the model<br />Most useful facts are additive<br />
    23. 23. Dimension Table<br />Descriptors of each fact<br />Tend to have many attributes but fewer rows<br />Tend to be used as query constraints.<br />The better the attribute descriptions the better the warehouse<br />Typically highly denormalized<br />
    24. 24. Star Schema<br />This is a fact table joined to a set of dimensions<br />Relates data in a manner that is familiar to business users.<br />Symmetrical nature allows for many answering many different business questions<br />One dimensional model will exist for each business process. <br />A single data warehouse can have dozens of these models.<br />
    25. 25. Dimensional Modeling Key Concepts<br />
    26. 26. Store the most atomic data<br />By storing the most detail data possible you can ensure that users can drill to the level they need. <br />Its OK to provide aggregate facts as well to improve performance.<br />
    27. 27. Conformed Dimensions<br />By conforming your dimensions you can correlate performance across business processes.<br />Can be very painful (but worth it) if combining data from disparate systems.<br />
    28. 28. Always use an artificial key as the primary key<br />Surrogate Key allow you to:<br />Protect you model from changes in the source system<br />Integrate data from multiple sources<br />Add rows that do not exist in the source system.<br />Track changes to dimensions over time.<br />Use Surrogate Keys<br />
    29. 29. A key design consideration is what to do when dimension values change.<br />A change may or may not have business meaning.<br />Three ways to handle changes<br />Slowly Changing Dimensions<br />
    30. 30. Slowly Changing Dimension Types<br />Type I<br />Simply overwrite the old values.<br />Simplest case, used when you don’t care about changes to data.<br />Type II<br />Create a new dimension row for new values<br />Existing facts still relate to old dimension value<br />Used when you do care about the historical changes.<br />Type III<br />Add a new column to table to store the new value<br />Rarely used.<br />
    31. 31. Dates are a fundamental Business concept and nearly every DW has a date dimension<br />The date dimension is the classic role playing dimension.<br />Allows rollups/filters on any date related attribute such as month/quarter/year <br />Date dimension records still use a surrogate to handle unknown dates.<br />Date Dimensions<br />
    32. 32. Snowflaking is the process of hooking up lookup tables to a dimension.<br />This is in a way re-normalizing the data.<br />Snowflaking is in general discouraged since it adds complexity to the model.<br />Snowflaking<br />
    33. 33. Most relationships are one-to-many. This is the simplest case.<br />Real world scenarios are often more complex.<br />Many to Many between facts & dimensions are represented by creating a bridge table between the facts and the dimension<br />Many to Many Relationships<br />
    34. 34. Hierarchies summarize or group the data within the dimension.<br />Typically are de-normalized into the dimension table<br />Hierarchies<br />
    35. 35. There are three types of fact tables<br />Transaction<br />Tracks each transaction as it occurs.<br />Periodic Snapshot<br />Captures cumulative performance over a specific period of time<br />Often used for periodic rollups<br />Accumulating Snapshot<br />Updated over time<br />Types of Fact Tables<br />
    36. 36. Physical Design Considerations<br />
    37. 37. System Sizing Considerations<br />Storage<br />The fact table volumes will drive storage requirements.<br />Don’t forget to account for staging storage needs.<br />Performance<br />Understand the usage complexity of your community.<br />Predefined reports & queries can be cachedpre-aggregated.<br />The more ad-hoc analysis that is used will impact the hardware requirements.<br />Must understand how many simultaneous user the DW will be asked to support.<br />Memory<br />All the BI components Love RAM.<br />Use 64-bit hardware to address more memory space.<br />
    38. 38. System Configuration Considerations<br />All-In-One Configuration<br />All components hosted on a single server<br />Appropriate for small deployment or POC’s<br />
    39. 39. System Configuration Considerations<br />Separate Reporting Server<br />Reporting Server scaled out<br />Appropriate for mid-sized deployments<br />
    40. 40. System Configuration Considerations<br />Scale Out Deployment<br />Both Report Services & Analysis Services have their own servers<br />Appropriate for larger deployments<br />Can be scaled massively from here<br />
    41. 41. Common Pitfalls to Avoid<br />
    42. 42. Becoming overly focused on technology rather then business requirements & goals<br />Failure to embrace an influential management visionary as the business sponsor<br />Tackle a huge multiyear project rather then smaller iterative development efforts.<br />Paying more attention to back-end issues and ease of development then front-end performance and simplicity.<br />Common Pitfalls<br />
    43. 43. Making the query able data overly complex<br />Populating model without properly conforming your dimensions<br />Loading only summary data into your models<br />Presuming that the business requirements are static<br />Neglect to understand that the data warehouse success is tied to user acceptance.<br />Common Pitfalls Continued….<br />