Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BG SQL&BI User group meeting - DWH modeling - the basics

147 views

Published on

BG SQL&BI User group meeting - DWH modeling - the basics

Published in: Technology
  • Be the first to comment

  • Be the first to like this

BG SQL&BI User group meeting - DWH modeling - the basics

  1. 1. Datawarehouse modeling – the basics SQL&BI UG Meeting Oct 2019
  2. 2. What Is a Data Warehouse? A centralized store of business data for reporting and analysis that typically: • Contains large volumes of historical data • Is optimized for querying, as opposed to inserting or updating data • Is incrementally loaded with new business data at regular intervals • Provides the basis for enterprise BI solutions
  3. 3. Data Warehouse Architectures Central Data Warehouse Departmental Data Marts Hub-and-Spoke
  4. 4. Components of a Data Warehousing Solution Data Sources ETL and Data Cleansing Data Warehouse Data Models Reporting and Analysis MDM
  5. 5. The dimensional model 5
  6. 6. The Dimensional Model Snowflake schema DimensionAttributes DimensionAttributes Star schema Dimension Attributes FactMeasures DimensionAttributes DimensionAttributes DimensionAttributes
  7. 7. Star Schema
  8. 8. Star schema in PowerBI Models 8
  9. 9. The Data Warehouse Design Process 1. Determine analytical and reporting requirements 2. Identify the business processes that generate the required data 3. Examine the source data for those business processes 4. Conform dimensions across business processes 5. Prioritize processes and create a dimensional model for each 6. Document and refine the models to determine the database logical schema 7. Design the physical data structures for the database
  10. 10. Dimensional Modeling Business Processes Time Product Customer Salesperson FactoryLine Shipper Account Department Warehouse Manufacturing x x x Order Processing x x x x Order Fulfillment x x x Financial Accounting x x x Inventory Management x x x • Grain: 1 row per order item • Dimensions: Time (order date and ship date), Product, Customer, Salesperson • Facts: Item Quantity, Unit Cost, Total Cost, Unit Price, Sales Amount, Shipping Cost
  11. 11. Documenting Dimensional Models Country State or Province City Age Marital Status Gender Sales Order Item Quantity Unit Cost Total Cost Unit Price Sales Amount Shipping Cost Time (Order Date and Ship Date) CustomerProduct Calendar Year Month Date Fiscal Year Fiscal Quarter Month Date Region Country Territory Manager Surname Forename Category Subcategory Product Name Color Size Salesperson
  12. 12. Lesson 2: Designing Dimension Tables  Considerations for Dimension Keys  Dimension Attributes and Hierarchies  Unknown and None  Designing Slowly Changing Dimensions  Time Dimension Tables  Self-Referencing Dimension Tables  Junk Dimensions
  13. 13. Considerations for Dimension Keys ProductKey ProductAltKey ProductName Color Size 1 MB1-B-32 MB1 Mountain Bike Blue 32 2 MB1-R-32 MB1 Mountain Bike Red 32 CustomerKey CustomerAltKey Name 1 1002 Amy Alberts 2 1005 Neil Black Surrogate Key Business (Alternate) Key
  14. 14. Dimension Attributes and Hierarchies CustKey CustAltKey Name Country State City Phone Gender 1 1002 Amy Alberts Canada BC Vancouver 555 123 F 2 1005 Neil Black USA CA Irvine 555 321 M 3 1006 Ye Xu USA NY New York 555 222 M Hierarchy SlicerDrill-through detail
  15. 15. Unknown and None • Identify the semantic meaning of NULL • Unknown or None? • Do not assume NULL equality • Use ISNULL( ) OrderNo Discount DiscountType 1000 1.20 Bulk Discount 1001 0.00 N/A 1002 2.00 1003 0.50 Promotion 1004 2.50 Other 1005 0.00 N/A 1006 1.50 Source Dimension Table DiscKey DiscAltKey DiscountType -1 Unknown Unknown 0 N/A None 1 Bulk Discount Bulk Discount 2 Promotion Promotion 3 Other Other
  16. 16. Designing Slowly Changing Dimensions CustKey CustAltKey Name Phone 1 1002 Amy Alberts 555 123 CustKey CustAltKey Name City Current Start End 1 1002 Amy Alberts Vancouver Yes 1/1/2000 CustKey CustAltKey Name Phone 1 1002 Amy Alberts 555 222 Type 1 CustKey CustAltKey Name City Current Start End 1 1002 Amy Alberts Vancouver No 1/1/2000 1/1/2012 4 1002 Amy Alberts Toronto Yes 1/1/2012 Type 2 CustKey CustAltKey Name Cars 1 1002 Amy Alberts 0 CustKey CustAltKey Name Prior Cars Current Cars 1 1002 Amy Alberts 0 1 Type 3
  17. 17. Time Dimension Tables • Surrogate key • Granularity • Range • Attributes and hierarchies • Multiple calendars • Unknown values DateKey DateAltKey MonthDay WeekDay Day MonthNo Month Year 00000000 01-01-1753 NULL NULL NULL NULL NULL NULL 20130101 01-01-2016 1 3 Tue 01 Jan 2016 20130102 01-02-2016 2 4 Wed 01 Jan 2016 20130103 01-03-2016 3 5 Thu 01 Jan 2016 20130104 01-04-2016 4 6 Fri 01 Jan 2016
  18. 18. Self-Referencing Dimension Tables EmployeeKey EmployeeAltKey EmployeeName ManagerKey 1 1000 Kim Abercrombie NULL 2 1001 Kamil Amireh 1 3 1002 Cesar Garcia 1 4 1003 Jeff Hay 2 Kim Abercrombie 1st Level Manager Kamil Amireh 2nd Level Manager Jeff Hay 3rd Level (Employee) Cesar Garcia 2nd Level Manager
  19. 19. Junk Dimensions • Combine low-cardinality attributes that don’t belong in existing dimensions into a junk dimension • Avoids creating many small dimension tables JunkKey OutOfStockFlag FreeShippingFlag CreditOrDebit 1 1 1 Credit 2 1 1 Debit 3 1 0 Credit 4 1 0 Debit 5 0 1 Credit 6 0 1 Debit 7 0 0 Credit 8 0 0 Debit
  20. 20. Fact tables Fact Table Columns Types of Measure Types of Fact Table
  21. 21. Fact Table Keys • Dimension Keys • Measures • Degenerate Dimensions OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount 20160101 25 120 1000 1 350.99 20160101 99 120 1000 2 6.98 20160101 25 178 1001 2 701.98 OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount 20160101 25 120 1000 1 350.99 20160101 99 120 1000 2 6.98 20160101 25 178 1001 2 701.98 OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount 20160101 25 120 1000 1 350.99 20160101 99 120 1000 2 6.98 20160101 25 178 1001 2 701.98
  22. 22. Types of Measure • Additive • Semi-Additive • Non-Additive OrderDateKey ProductKey CustomerKey SalesAmount 20160101 25 120 350.99 20160101 99 120 6.98 20160102 25 178 701.98 DateKey ProductKey StockCount 20160101 25 23 20160101 99 118 20160102 25 22 OrderDateKey ProductKey CustomerKey ProfitMargin 20160101 25 120 25 20160101 99 120 22 20160102 25 178 27
  23. 23. Types of Fact Table • Transaction Fact Table • Periodic Snapshot Fact Table • Accumulating Snapshot Fact Table OrderDateKey ProductKey CustomerKey OrderNo Qty Cost SalesAmount 20160101 25 120 1000 1 125.00 350.99 20160101 99 120 1000 2 2.50 6.98 20160101 25 178 1001 2 250.00 701.98 DateKey ProductKey OpeningStock UnitsIn UnitsOut ClosingStock 20160101 25 25 1 3 23 20160101 99 120 0 2 118 OrderNo OrderDateKey ShipDateKey DeliveryDateKey 1000 20160101 20160102 20160105 1001 20160101 20160102 00000000 1002 20160102 00000000 00000000
  24. 24. Physical implementation specifics 24
  25. 25. 25 Considerations for physical implementation  Data files and filegroups  Staging tables  tempdb  Transaction logs  Backup files  Partitioning  Indexing
  26. 26. Considerations for Indexes • Dimension table indexes • Clustered index on surrogate key column • Nonclustered index on business key and SCD columns • Nonclustered indexes on frequently searched columns • Fact table indexes • Clustered index on most commonly searched date key • Nonclustered indexes on other dimension keys Or • Columnstore index on all columns • Composite index key comprises up to 16 columns
  27. 27. Incremental load flows 27
  28. 28. Options for Extracting Modified Data • Extract all records • Store a primary key and checksum • Use a datetime column as a “high water mark” • Use Change Data Capture • Use Change Tracking
  29. 29. Common ETL Data Flow Architectures • Single-stage ETL:  Data is transferred directly from source to data warehouse  Transformations and validations occur in-flight or on extraction • Two-stage ETL:  Data is staged for a coordinated load  Transformations and validations occur in-flight, or on staged data • Three-stage ETL:  Data is extracted quickly to a landing zone, and then staged prior to loading  Transformations and validation can occur throughout the data flow Source DW Source DWStaging Source DWStaging Landing Zone

×