Successfully reported this slideshow.
Your SlideShare is downloading. ×

BG SQL&BI User group meeting - DWH modeling - the basics

BG SQL&BI User group meeting - DWH modeling - the basics

  1. 1. Datawarehouse modeling – the basics SQL&BI UG Meeting Oct 2019
  2. 2. What Is a Data Warehouse? A centralized store of business data for reporting and analysis that typically: • Contains large volumes of historical data • Is optimized for querying, as opposed to inserting or updating data • Is incrementally loaded with new business data at regular intervals • Provides the basis for enterprise BI solutions
  3. 3. Data Warehouse Architectures Central Data Warehouse Departmental Data Marts Hub-and-Spoke
  4. 4. Components of a Data Warehousing Solution Data Sources ETL and Data Cleansing Data Warehouse Data Models Reporting and Analysis MDM
  5. 5. The dimensional model 5
  6. 6. The Dimensional Model Snowflake schema DimensionAttributes DimensionAttributes Star schema Dimension Attributes FactMeasures DimensionAttributes DimensionAttributes DimensionAttributes
  7. 7. Star Schema
  8. 8. Star schema in PowerBI Models 8
  9. 9. The Data Warehouse Design Process 1. Determine analytical and reporting requirements 2. Identify the business processes that generate the required data 3. Examine the source data for those business processes 4. Conform dimensions across business processes 5. Prioritize processes and create a dimensional model for each 6. Document and refine the models to determine the database logical schema 7. Design the physical data structures for the database
  10. 10. Dimensional Modeling Business Processes Time Product Customer Salesperson FactoryLine Shipper Account Department Warehouse Manufacturing x x x Order Processing x x x x Order Fulfillment x x x Financial Accounting x x x Inventory Management x x x • Grain: 1 row per order item • Dimensions: Time (order date and ship date), Product, Customer, Salesperson • Facts: Item Quantity, Unit Cost, Total Cost, Unit Price, Sales Amount, Shipping Cost
  11. 11. Documenting Dimensional Models Country State or Province City Age Marital Status Gender Sales Order Item Quantity Unit Cost Total Cost Unit Price Sales Amount Shipping Cost Time (Order Date and Ship Date) CustomerProduct Calendar Year Month Date Fiscal Year Fiscal Quarter Month Date Region Country Territory Manager Surname Forename Category Subcategory Product Name Color Size Salesperson
  12. 12. Lesson 2: Designing Dimension Tables  Considerations for Dimension Keys  Dimension Attributes and Hierarchies  Unknown and None  Designing Slowly Changing Dimensions  Time Dimension Tables  Self-Referencing Dimension Tables  Junk Dimensions
  13. 13. Considerations for Dimension Keys ProductKey ProductAltKey ProductName Color Size 1 MB1-B-32 MB1 Mountain Bike Blue 32 2 MB1-R-32 MB1 Mountain Bike Red 32 CustomerKey CustomerAltKey Name 1 1002 Amy Alberts 2 1005 Neil Black Surrogate Key Business (Alternate) Key
  14. 14. Dimension Attributes and Hierarchies CustKey CustAltKey Name Country State City Phone Gender 1 1002 Amy Alberts Canada BC Vancouver 555 123 F 2 1005 Neil Black USA CA Irvine 555 321 M 3 1006 Ye Xu USA NY New York 555 222 M Hierarchy SlicerDrill-through detail
  15. 15. Unknown and None • Identify the semantic meaning of NULL • Unknown or None? • Do not assume NULL equality • Use ISNULL( ) OrderNo Discount DiscountType 1000 1.20 Bulk Discount 1001 0.00 N/A 1002 2.00 1003 0.50 Promotion 1004 2.50 Other 1005 0.00 N/A 1006 1.50 Source Dimension Table DiscKey DiscAltKey DiscountType -1 Unknown Unknown 0 N/A None 1 Bulk Discount Bulk Discount 2 Promotion Promotion 3 Other Other
  16. 16. Designing Slowly Changing Dimensions CustKey CustAltKey Name Phone 1 1002 Amy Alberts 555 123 CustKey CustAltKey Name City Current Start End 1 1002 Amy Alberts Vancouver Yes 1/1/2000 CustKey CustAltKey Name Phone 1 1002 Amy Alberts 555 222 Type 1 CustKey CustAltKey Name City Current Start End 1 1002 Amy Alberts Vancouver No 1/1/2000 1/1/2012 4 1002 Amy Alberts Toronto Yes 1/1/2012 Type 2 CustKey CustAltKey Name Cars 1 1002 Amy Alberts 0 CustKey CustAltKey Name Prior Cars Current Cars 1 1002 Amy Alberts 0 1 Type 3
  17. 17. Time Dimension Tables • Surrogate key • Granularity • Range • Attributes and hierarchies • Multiple calendars • Unknown values DateKey DateAltKey MonthDay WeekDay Day MonthNo Month Year 00000000 01-01-1753 NULL NULL NULL NULL NULL NULL 20130101 01-01-2016 1 3 Tue 01 Jan 2016 20130102 01-02-2016 2 4 Wed 01 Jan 2016 20130103 01-03-2016 3 5 Thu 01 Jan 2016 20130104 01-04-2016 4 6 Fri 01 Jan 2016
  18. 18. Self-Referencing Dimension Tables EmployeeKey EmployeeAltKey EmployeeName ManagerKey 1 1000 Kim Abercrombie NULL 2 1001 Kamil Amireh 1 3 1002 Cesar Garcia 1 4 1003 Jeff Hay 2 Kim Abercrombie 1st Level Manager Kamil Amireh 2nd Level Manager Jeff Hay 3rd Level (Employee) Cesar Garcia 2nd Level Manager
  19. 19. Junk Dimensions • Combine low-cardinality attributes that don’t belong in existing dimensions into a junk dimension • Avoids creating many small dimension tables JunkKey OutOfStockFlag FreeShippingFlag CreditOrDebit 1 1 1 Credit 2 1 1 Debit 3 1 0 Credit 4 1 0 Debit 5 0 1 Credit 6 0 1 Debit 7 0 0 Credit 8 0 0 Debit
  20. 20. Fact tables Fact Table Columns Types of Measure Types of Fact Table
  21. 21. Fact Table Keys • Dimension Keys • Measures • Degenerate Dimensions OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount 20160101 25 120 1000 1 350.99 20160101 99 120 1000 2 6.98 20160101 25 178 1001 2 701.98 OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount 20160101 25 120 1000 1 350.99 20160101 99 120 1000 2 6.98 20160101 25 178 1001 2 701.98 OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount 20160101 25 120 1000 1 350.99 20160101 99 120 1000 2 6.98 20160101 25 178 1001 2 701.98
  22. 22. Types of Measure • Additive • Semi-Additive • Non-Additive OrderDateKey ProductKey CustomerKey SalesAmount 20160101 25 120 350.99 20160101 99 120 6.98 20160102 25 178 701.98 DateKey ProductKey StockCount 20160101 25 23 20160101 99 118 20160102 25 22 OrderDateKey ProductKey CustomerKey ProfitMargin 20160101 25 120 25 20160101 99 120 22 20160102 25 178 27
  23. 23. Types of Fact Table • Transaction Fact Table • Periodic Snapshot Fact Table • Accumulating Snapshot Fact Table OrderDateKey ProductKey CustomerKey OrderNo Qty Cost SalesAmount 20160101 25 120 1000 1 125.00 350.99 20160101 99 120 1000 2 2.50 6.98 20160101 25 178 1001 2 250.00 701.98 DateKey ProductKey OpeningStock UnitsIn UnitsOut ClosingStock 20160101 25 25 1 3 23 20160101 99 120 0 2 118 OrderNo OrderDateKey ShipDateKey DeliveryDateKey 1000 20160101 20160102 20160105 1001 20160101 20160102 00000000 1002 20160102 00000000 00000000
  24. 24. Physical implementation specifics 24
  25. 25. 25 Considerations for physical implementation  Data files and filegroups  Staging tables  tempdb  Transaction logs  Backup files  Partitioning  Indexing
  26. 26. Considerations for Indexes • Dimension table indexes • Clustered index on surrogate key column • Nonclustered index on business key and SCD columns • Nonclustered indexes on frequently searched columns • Fact table indexes • Clustered index on most commonly searched date key • Nonclustered indexes on other dimension keys Or • Columnstore index on all columns • Composite index key comprises up to 16 columns
  27. 27. Incremental load flows 27
  28. 28. Options for Extracting Modified Data • Extract all records • Store a primary key and checksum • Use a datetime column as a “high water mark” • Use Change Data Capture • Use Change Tracking
  29. 29. Common ETL Data Flow Architectures • Single-stage ETL:  Data is transferred directly from source to data warehouse  Transformations and validations occur in-flight or on extraction • Two-stage ETL:  Data is staged for a coordinated load  Transformations and validations occur in-flight, or on staged data • Three-stage ETL:  Data is extracted quickly to a landing zone, and then staged prior to loading  Transformations and validation can occur throughout the data flow Source DW Source DWStaging Source DWStaging Landing Zone

Editor's Notes

  • Explain that this course uses a fairly generic definition for a data warehouse. There are many very specific definitions in use throughout the data warehousing industry; students might be aware of some of the more common schools of thought on database design, including those of Bill Inmon and Ralph Kimball. This course does not advocate one approach over another, although the lab solutions and data warehouse schema design discussed here are more in line with a Kimball-based approach than any other.
  • Encourage students to consider the benefits of each architecture discussed in this topic:
    A centralized data warehouse provides a single source for all business analysis, reporting, and decision-making. However, designing and building such a comprehensive solution presents some significant challenges in devising a single schema that meets the needs of every business unit, and can result in huge volumes of data being transferred into the data warehouse from various source systems.
    A departmental data warehouse is more focused on the needs of a specific set of users, but might not hold a complete business-wide view of all key data. You might end up with multiple small data marts, each dealing with a discrete part of the business. These will improve reporting, analysis, and decision-making within individual departments but will not necessarily make it easier to get an overall view of the business.
    In some ways, a hub-and-spoke architecture offers the best aspects of the centralized data warehouse and departmental data mart approaches. However, it can be difficult to design and build—and synchronizing data between the hub and various spokes can present its own challenges.
    For a more detailed discussion about these various architectures, see Hub-And-Spoke: Building an EDW with SQL Server and Strategies of Implementation at:
    Building an EDW with SQL Server
    http://go.microsoft.com/fwlink/?LinkID=248853
  • Note that not all data warehousing solutions include every component shown on the slide. However, each component has an important part to play in the implementation of a data warehousing solution, and will be considered later in this course.
  • If students want to better understand how the star join query optimizations in the SQL Server query optimizer work, they should review the articles referenced in their notes. However, emphasize that the optimizations are automatic and that students do not need to do anything, other than use a star schema for the data warehouse tables, to benefit from them.
  • Note the reference to “The Microsoft Data Warehouse Toolkit” in the student notes. The principles described in this book underpin most of the generally accepted best practices for designing data warehouses with SQL Server. Therefore, this book is highly recommended reading for any instructor teaching this course.
  • Question
    What is the difference between a snowflake schema and a star schema?
    Answer
    A star schema has all the dimension attributes linked directly to the fact measures. A snowflake schema has some of the dimension attributes linked to each other, in addition to others linked directly to the fact measures.
  • Use this topic to ensure that all students understand why the business key from the source system is not used as a unique key in dimension tables.
  • Emphasize that the categorization of attributes in this topic is simply used to help identify reasons why a data value would be included as a dimension attribute column. You do not need to apply any specific configuration to define an attribute as a slicer or a member of a hierarchy.
    Point out that the levels of the hierarchy are all stored within a single dimension table, resulting in duplication. This is preferable to normalizing the data to create a table for each hierarchy in a snowflake schema. OLTP database developers might find this preference for duplication over normalization unintuitive. However, you should remind them that dimension data is generally denormalized from multiple tables before being loaded, and does not experience the same level of transactional updates as would occur in an OLTP database. Therefore, the performance benefits of storing the data in a single table generally outweigh the reduced duplication benefits of normalizing the data.
  • In the example on the slide, the dimension data is not normalized in the source system, so there is no business key. Instead, the value itself is used as the alternate key for rows where the value is not “Unknown” or “None”.
    In this example, ask students to consider a scenario where a foreign key relationship in the source data references a lookup table for the DiscountType column. The foreign key table includes a numeric primary key column with the values, 0, 1, 2, and 3 for the discount types “N/A,” “Bulk Discount,” “Promotion,” and “Other,” respectively. How would this affect the design of the dimension table?
    One answer is that the primary key values in the source system would serve as the alternate key values in the dimension table; an unused value (such as -1) would be used for the “Unknown” row. The source foreign key and dimension alternate key could then be compared using a similar ISNULL expression, as shown in the student notes (except that the value -1 would be used instead of “Unknown”).
    As an additional consideration, what if the lookup table in the source system did not include the 0 value for “N/A,” and both “None” and “Unknown” are indicated by a NULL in the source table? If users want to differentiate between “None” and “Unknown” in reports or analyses, you could implement some logic in the ETL load process to match NULL values to the dimension row for “Unknown” when the Discount value is non-zero, and to “N/A” when the discount is 0.00. However, unless this level of differentiation has business value, it would be easier to include only a single row for “None or Unknown” in the dimension table.
  • Explain that Type 2 slowly changing dimensions can be implemented by using a current flag, and start time stamp—or both. Point out the following considerations:
    If only a current flag is used, there is no way to match a new fact row to a dimension row, based on the time that the fact was recorded. The ETL process must assume that the current version of the dimension entity should be used.
    If only a start date is used, identifying the current row requires that you find the row with the most recent start date, typically by using the MAX function.
    Using an end date column makes it easier to find the right version of a dimension entity for fact, based on the point in time when the fact event occurred. Without the end date column, the appropriate dimension table row can only be determined by finding the most recent start date occurring before the date of the fact event.
  • Discuss the issues in the bulleted list in the student content. In some cases, you might choose to include a column for the parent alternate key in addition to the parent key, because this can be useful in some load techniques.
    Some techniques for loading self-referencing dimension tables are discussed in Module 6 of this course: Creating an ETL Solution.
  • As an alternative to a junk dimension, fact-specific attributes can be used to create degenerate dimensions in the fact table. This approach is discussed in the next lesson.
    Categorize Activity
    Place each item into the appropriate category which is a dimension type. Indicate your answer by writing the category number to the right of each item.
    Slicer
    (1)Gender
    Time Dimension
    (1)Fiscal Year
    (2)Month
    Junk Dimension
    (1)Invoice Number
    (2)Out of Stock Indicator
  • Point out that degenerate dimension columns provide the same capability as a junk dimension table. In a scenario where only one fact table requires the additional miscellaneous attributes for analysis and reporting, it is generally more efficient to include them as degenerate dimension columns. Conversely, if the additional attributes are relevant for multiple fact tables, a junk dimension is probably a better choice.
    Discuss the note about fact table primary keys in the student manual. Students with a strong background in relational database design might feel uncomfortable about not defining a primary key for every table. If, however, there is no need to uniquely identify individual fact rows, and the ETL process can be relied on to eliminate accidental duplicate entries, defining a primary key adds unnecessary overhead to the table definition and generates an index, which can negatively affect the performance of data loads.
    Similarly, note that declaring foreign key constraints on dimension-key columns in a fact table is not necessary to enforce referential integrity in most data warehouses, and can negatively impact load performance. You can declare them, and then drop and recreate them during each load, but this creates its own overhead and adds little value if the ETL process is correctly implemented. The query optimizer can use foreign key constraints to identify the fact table in a star join query but, in their absence, selects the largest table, which is usually correct.
  • Use the examples on the slide and in the student notes to show how summing semi-additive and non-additive measures produces meaningless results.
  • Discuss the importance of including a row for “Unknown” or “None” in the time dimension table when using accumulating snapshot fact tables.
    Point out that accumulating snapshot fact tables must be updated after the initial load. This requirement can affect the physical design of the table, especially if partitions or column store indexes are used. These considerations are discussed in the next lesson and in Module 6: Creating an ETL Solution.
    Question
    What kind of measure is a stock count?
    ( )Option 1: Additive
    ( )Option 2: Semi-additive
    ( )Option 3: Non-additive
    Answer
    (√) Option 2: Semi-additive
  • Point out that query performance in a data warehouse is degraded by fragmentation, and having many indexes on tables in a data warehouse can lead to significant fragmentation after each ETL data load. Additionally, the loads take longer because index keys must be stored on the correct page. To reduce the fragmentation caused by a data load, you can use a fill factor to leave space for inserts. Periodically, however, you will still need to either reorganize or rebuild indexes, which will affect performance and might require some indexes to be taken offline. Alternatively, you can drop all indexes before each load and recreate them afterwards, an action that improves load performance, but incurs the overhead of recreating the indexes and, depending on the volume of data loaded, may be very time-consuming.
  • Point out the note in the student workbook, and discuss the considerations for propagating deletions in source databases to the data warehouse. Emphasize that, in most scenarios, data warehouses are used to store historical data, so it is common for deletions to not be propagated. In some cases, logical deletions are performed in the data warehouse by setting a “deleted” flag column.

×