Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Warehouse Design 
Best Practices
About me 
 Project Manager @ 
 12 years professional experience 
 .NET Web Development MCPD 
 SQL Server 2012 (MCSA) 
 Bus...
About me 
 Senior Developer @ 
 .NET Web Development MCPD 
 Business Interests 
 Web Development, WCF, Integration 
 SQL S...
Sponsors
Agenda 
 Why Data Warehouse 
 Main DW Architectures 
 Dimensional Modeling 
 Patterns  Practices 
 DW Maintenance 
 ETL Pr...
Lots of Data Everywhere 
 Can’t find data? 
 Data scattered over the network 
 Can’t get data? 
 Need an expert to get the...
Data Warehouse? 
Def: Central repository where data are organized, cleansed 
and in standardized format. 
 Integrated 
 He...
Different Problems - Different Solutions 
OLTP Database Data Warehouse 
Users Customer Knowledge worker 
Design Normalized...
Different DW Architectures
B.Inmon Model 
Top-Down Approach 
 Warehouse (3NF) 
 Data Mart  OLAP (MD) 
http://sqlschoolgr.files.wordpress.com/2012/03/...
R.Kimball Model 
Bottom-Up Approach 
 Data Marts (3NF or MD) 
 Warehouse  OLAP (MD) 
http://sqlschoolgr.files.wordpress.co...
Data Vault (by Dan Linstedt) 
 Hubs 
 List of unique business keys 
 Links 
 Unique relationships between keys 
 Satellite...
It is irrelevant which camp you belong… 
as far as you understand why!
Making Your Choice 
• Kimball (MD) 
+ Start small, scale big 
+ Faster ROI 
+ Analytical tools 
- Low reusability 
• Data ...
Dimensional modeling as de-facto standard
Dimensions 
Def: The object of BI interest 
 Keys 
 Surrogate key 
 Business key 
 Hierarchical attributes 
 Analysis and ...
Slowly Changing Dimensions 
Def: Scheme for recording changes over time 
 Type 1 - Overwrite 
 Type 2 – Multiple Records
Facts 
Def: Measurement of a business process 
 Keys 
 FK from all dimensional tables (in the star) 
 PK - Composite (usua...
Practices and Design Patterns
Data Warehouse Pitfalls 
 Admit it is not as it seems to be 
 You need education 
 Find what is of business value 
 Rather...
Prepare your Sources 
 Data integrity 
 Avoid redundancy 
 Data quality 
 Master data source 
 Data validation 
 Auditing ...
Dimension Design 
 Business key with non-clustered index 
 Include date (if dimension has history) 
 Surrogate key 
 The s...
Conformed Dimensions 
Def. Having the same meaning and content 
when referred from multiple fact tables 
 Date Dimension 
...
Pre-join Hierarchies 
 Recursive relationships 
 Fast drill and report 
 Pre-computed aggregations 
Hierarchy Bridge 
 For...
Determine the Facts 
The center of a Star schema 
 Identify subject areas 
 Identify key business events 
 Identify dimens...
The Grain 
Def: The level of detail of a fact table 
 What is the business objective? 
 Fine grain - behaviour and frequen...
C3-PO is fluent in 6M forms of communication. 
What about your customers?
Multinational DW 
 What parts need translation? 
 Where to store various language versions? 
 How to support future langua...
Data warehouse maintenance
How Large is “Large” 
Is big really big?
Partitioning 
 Why 
 Faster index maintenance 
 Faster load 
 Faster queries 
 When 
 Tables 10GB+ 
 How 
 Do not partitio...
Columnstore Index 
 Non-clustered in SQL 2012 
 Clustered in SQL 2014 
 Pros 
 Better data compression 
 High performance ...
Extract-Transform-Load 
 Extract data from OLTP 
 Data transformations 
 Data loads 
 DW maintenance
Efficient Load Process 
 Use simple recovery model during data load 
 Staging 
 Avoid indexing 
 Populate in parallel 
 Ma...
To SSIS, or not to SSIS ? 
Pros 
 Minimum coding to none 
 Extensive support of various data sources 
 Parallel execution ...
Takeaways 
 Books 
 The Data Warehouse Toolkit (3rd ed), Ralph Kimball 
 Implementing DW with Microsoft SQL Server 2012 
 ...
Data Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
Upcoming SlideShare
Loading in …5
×

Data Warehouse Design and Best Practices

52,575 views

Published on

A data warehouse is a database designed for query and analysis rather than for transaction processing. An appropriate design leads to scalable, balanced and flexible architecture that is capable to meet both present and long-term future needs. This session covers a comparison of the main data warehouse architectures together with best practices for the logical and physical design that support staging, load and querying.

Published in: Data & Analytics

Data Warehouse Design and Best Practices

  1. 1. Data Warehouse Design Best Practices
  2. 2. About me Project Manager @ 12 years professional experience .NET Web Development MCPD SQL Server 2012 (MCSA) Business Interests Web Development, SOA, Integration Security Performance Optimization Horizon2020, Open BIM, GIS, Mapping Contact me ivelin.andreev@icb.bg www.linkedin.com/in/ivelin www.slideshare.net/ivoandreev 2 |
  3. 3. About me Senior Developer @ .NET Web Development MCPD Business Interests Web Development, WCF, Integration SQL Server – Query Optimization and Tuning Data Warehousing Contact me georgi.mishev@icb.bg www.linkedin.com/in/georgimishev
  4. 4. Sponsors
  5. 5. Agenda Why Data Warehouse Main DW Architectures Dimensional Modeling Patterns Practices DW Maintenance ETL Process SSIS Demo
  6. 6. Lots of Data Everywhere Can’t find data? Data scattered over the network Can’t get data? Need an expert to get the data Can’t understand data? Data poorly documented Can’t use data found? Data needs to be transformed
  7. 7. Data Warehouse? Def: Central repository where data are organized, cleansed and in standardized format. Integrated Heterogeneous sources Data clean and conversion ($, €, 元) Focus on subject i.e. Customer, Sale, Product Time variant Timestamp every key Historical data (10+ years)
  8. 8. Different Problems - Different Solutions OLTP Database Data Warehouse Users Customer Knowledge worker Design Normalized, Data Integrity Denormalized Function Daily operation Decision making Data Current, Detailed Historical, Aggregated Usage Real time Ad-hoc Access Short R/W transactions Complex R/O queries Data accessed Comparatively lower Large Amounts # Records x100 x1’000’000 # Users x1’000 x10 DB Size x10 GB x100GB-TB
  9. 9. Different DW Architectures
  10. 10. B.Inmon Model Top-Down Approach Warehouse (3NF) Data Mart OLAP (MD) http://sqlschoolgr.files.wordpress.com/2012/03/clip_image003_thumb.png?w=640h=368
  11. 11. R.Kimball Model Bottom-Up Approach Data Marts (3NF or MD) Warehouse OLAP (MD) http://sqlschoolgr.files.wordpress.com/2012/03/clip_image005_thumb.png?w=640h=369
  12. 12. Data Vault (by Dan Linstedt) Hubs List of unique business keys Links Unique relationships between keys Satellites Hub and Link details and history
  13. 13. It is irrelevant which camp you belong… as far as you understand why!
  14. 14. Making Your Choice • Kimball (MD) + Start small, scale big + Faster ROI + Analytical tools - Low reusability • Data Vault • Inmon (3NF) + Structured + Easy to maintain + Easier data mining - Timely to build Backend Data Warehouse + Multiple sources; Full history; Incremental build - Up-front work; Long-term payoff; Many joins
  15. 15. Dimensional modeling as de-facto standard
  16. 16. Dimensions Def: The object of BI interest Keys Surrogate key Business key Hierarchical attributes Analysis and Drill Down Member properties Presentation labels Auditing information (not for end users)
  17. 17. Slowly Changing Dimensions Def: Scheme for recording changes over time Type 1 - Overwrite Type 2 – Multiple Records
  18. 18. Facts Def: Measurement of a business process Keys FK from all dimensional tables (in the star) PK - Composite (usually) or Surrogate Measures Numeric columns, that are of interest to the business Additive, Non-additive, Semi-additive Factless facts Auditing information (optional)
  19. 19. Practices and Design Patterns
  20. 20. Data Warehouse Pitfalls Admit it is not as it seems to be You need education Find what is of business value Rather than focus on performance Spend a lot of time in Extract-Transform-Load Homogenize data from different sources Find (and resolve) problems in source systems
  21. 21. Prepare your Sources Data integrity Avoid redundancy Data quality Master data source Data validation Auditing CreatedDate / CreatedBy ChangedDate / ChangedBy Nightly jobs
  22. 22. Dimension Design Business key with non-clustered index Include date (if dimension has history) Surrogate key The smallest possible integer Clustered index FK constraints Do not enforce (WITH NOCHECK) Document the relation Faster load Data validation Task for the Source system
  23. 23. Conformed Dimensions Def. Having the same meaning and content when referred from multiple fact tables Date Dimension Partitioning best candidate Granularity Do not store every hour, when reporting daily Avoid surrogate keys Saves lookup and joins Integer representing date (yyyyMMdd, days after 1/1/1900)
  24. 24. Pre-join Hierarchies Recursive relationships Fast drill and report Pre-computed aggregations Hierarchy Bridge For each dimension row 1 association with self 1 row for each subordinate
  25. 25. Determine the Facts The center of a Star schema Identify subject areas Identify key business events Identify dimensions Start from OLTP logical model Identify historical requirements Identify attributes
  26. 26. The Grain Def: The level of detail of a fact table What is the business objective? Fine grain - behaviour and frequency analysis Coarse grain - overall and trend analysis Aggregates DO NOT summarize prematurely DO NOT mix detail and summary DO use “summary tables”
  27. 27. C3-PO is fluent in 6M forms of communication. What about your customers?
  28. 28. Multinational DW What parts need translation? Where to store various language versions? How to support future languages? Dimensions Add language attribute Include text data in the dimension Problem 1: The dimension key? Replicate PK for every language Fact.DimId = Dim.Id AND Dim.Lang=[Lang] Problem 2: Storage = [Dim] x [Lang] Sub-dimension with language attributes TxtId Attr1 Attr2 LangId 1 large Yes En 2 small No En 1 stor Ja No 2 liten Nei No 3 … … …
  29. 29. Data warehouse maintenance
  30. 30. How Large is “Large” Is big really big?
  31. 31. Partitioning Why Faster index maintenance Faster load Faster queries When Tables 10GB+ How Do not partition dimension tables Partition by date (most analysis are time-based) Eliminate partitions (WHERE [PartitionKey]=…) Avoid split and merge of existing partitions Can cause inefficient log generation
  32. 32. Columnstore Index Non-clustered in SQL 2012 Clustered in SQL 2014 Pros Better data compression High performance on table scan Clustered CSI Limitations No other indexes allowed Little advantage on seek operations No XML, computed column or replication
  33. 33. Extract-Transform-Load Extract data from OLTP Data transformations Data loads DW maintenance
  34. 34. Efficient Load Process Use simple recovery model during data load Staging Avoid indexing Populate in parallel Maintain DW Disable indexes on load Rebuild manually after load Automatic stats update slow down SQL Server
  35. 35. To SSIS, or not to SSIS ? Pros Minimum coding to none Extensive support of various data sources Parallel execution of migration tasks Better organization of the ETL process Cons Another way of thinking Hidden options T-SQL developer would do much faster Auto-generated flows need optimization Sometimes simply does not work (i.e. Sort by GUID)
  36. 36. Takeaways Books The Data Warehouse Toolkit (3rd ed), Ralph Kimball Implementing DW with Microsoft SQL Server 2012 Data Warehousing Fundamentals, Paulraj Ponniah Articles Best Practices in Data Warehouse (Hanover Research Council) http://www.kimballgroup.com/category/design-tips/ http://sqlmag.com/business-intelligence Resources http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/ dimensional-modeling-techniques/ http://www.databaseanswers.org/data_models/index.htm

×