Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building an Effective Data
Warehouse Architecture
James Serra, Big Data Evangelist
Microsoft
May 7-9, 2014 | San Jose, CA
Other Presentations
 Building an Effective Data Warehouse Architecture
Reasons for building a DW and the various approach...
About Me
 Business Intelligence Consultant, in IT for 28 years
 Microsoft, Big Data Evangelist
 Worked as desktop/web/d...
I tried to build a data warehouse on my own…
And ended up passed-out drunk in a Denny’s
parking lot
Let’s prevent that fro...
Agenda
 What a Data Warehouse is not
 What is a Data Warehouse and why use one?
 Fast Track Data Warehouse (FTDW)
 App...
What a Data Warehouse is not
• A data warehouse is not a copy of a source database with the name prefixed with “DW”
• It i...
Data Warehouse Maturity Model
Courtesy of Wayne Eckerson
What is a Data Warehouse and why use one?
A data warehouse is where you store data from multiple data sources to be used f...
Why use a Data Warehouse?
Legacy applications + databases = chaos
Production
Control
MRP
Inventory
Control
Parts
Managemen...
Hardware Solutions
 Fast Track Data Warehouse - A reference configuration optimized for data warehousing. This
saves an o...
Data Warehouse Fast Track for SQL
Server 2014
Hardware system design
• Tight specifications for servers, storage,
and netw...
Options for data warehouse solutions
Balancing flexibility
and choice
By yourself With a reference
architecture
With an ap...
Data Warehouse Fast Track advantages
Flexibility and ChoiceReduced riskFaster Deployment
Vendors with 2014 Fast Track Appliances
 Dell
 EMC
 HP/ScanDisk
 Lenovo
 NEC
 Tegile
Data Warehouse vs Data Mart
 Data Warehouse: A single organizational repository of enterprise wide data
across many or al...
Kimball and Inmon Methodologies
Two approaches for building data warehouses
Kimball and Inmon Myths
 Myth: Kimball is a bottom-up approach without enterprise focus
 Really top-down: BUS matrix (ch...
Kimball and Inmon Methodologies
 Relational (Inmon) vs Dimensional (Kimball)
 Relational Modeling:
 Entity-Relationship...
Relational Model vs Dimensional Model
Relational Model Dimensional Model
If you are a business user, which model is easier...
Kimball and Inmon Methodologies
• Kimball:
• Logical data warehouse (BUS), made up of subject areas (data marts)
• Busines...
Kimball Model
Why staging: Limit source contention (ELT), Recoverability, Backup, Auditing
Inmon Model
Reasons to add a Enterprise Data Warehouse
 Single version of the truth
 May make building dimensions easier using light...
Which model to use?
 Models are not that different, having become similar over the years, and can compliment
each other
...
Hybrid Model
Advice: Use SQL Server Views to interface between each level in the model
In the DW Bus Architecture, each da...
Kimball Methodology
From: Kimball’s The Microsoft Data Warehouse Toolkit
Kimball defines a development lifecycle, where In...
Populating a Data Warehouse
 Determine frequency of data pull (daily, weekly, etc)
 Full Extraction – All data (usually ...
ETL vs ELT
• Extract, Transform, and Load (ETL)
• Transform while hitting source system
• No staging tables
• Processing d...
Surrogate Keys
Surrogate Keys – Unique identifier not derived from source system
• Embedded in fact tables as foreign keys...
SSAS Cubes
Reasons to report off cubes instead of the data warehouse:
 Aggregating (Summarizing) the data for performance...
Data Warehouse Architecture
Modern Data Warehouse
Think about future needs:
• Increasing data volumes
• Real-time performance
• New data sources and t...
Resources
 Data Warehouse Architecture – Kimball and Inmon methodologies: http://bit.ly/SrzNHy
 SQL Server 2012: Multidi...
Q & A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.l...
Upcoming SlideShare
Loading in …5
×

Building an Effective Data Warehouse Architecture

121,338 views

Published on

Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.

Published in: Technology, Business

Building an Effective Data Warehouse Architecture

  1. 1. Building an Effective Data Warehouse Architecture James Serra, Big Data Evangelist Microsoft May 7-9, 2014 | San Jose, CA
  2. 2. Other Presentations  Building an Effective Data Warehouse Architecture Reasons for building a DW and the various approaches and DW concepts (Kimball vs Inmon)  Building a Big Data Solution (Building an Effective Data Warehouse Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s benefits including use cases, and how Hadoop, the cloud, and MPP fit in  Finding business value in Big Data (What exactly is Big Data and why should I care?) Very similar to “Building a Big Data Solution” but target audience is business users/CxO instead of architects  How does Microsoft solve Big Data? Covers the Microsoft products that can be used to create a Big Data solution  Modern Data Warehousing with the Microsoft Analytics Platform System The next step in data warehouse performance is APS, a MPP appliance  Power BI, Azure ML, Azure HDInsights, Azure Data Factory, etc Deep dives into the various Microsoft Big Data related products
  3. 3. About Me  Business Intelligence Consultant, in IT for 28 years  Microsoft, Big Data Evangelist  Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW developer  Been perm, contractor, consultant, business owner  Presenter at PASS Business Analytics Conference and PASS Summit  MCSE for SQL Server 2012: Data Platform and BI  Blog at JamesSerra.com  SQL Server MVP  Author of book “Reporting with Microsoft SQL Server 2012”
  4. 4. I tried to build a data warehouse on my own… And ended up passed-out drunk in a Denny’s parking lot Let’s prevent that from happening…
  5. 5. Agenda  What a Data Warehouse is not  What is a Data Warehouse and why use one?  Fast Track Data Warehouse (FTDW)  Appliances  Data Warehouse vs Data Mart  Kimball and Inmon Methodologies  Populating a Data Warehouse  ETL vs ELT  Surrogate Keys  SSAS Cubes  Modern Data Warehouse
  6. 6. What a Data Warehouse is not • A data warehouse is not a copy of a source database with the name prefixed with “DW” • It is not a copy of multiple tables (i.e. customer) from various sources systems unioned together in a view • It is not a dumping ground for tables from various sources with not much design put into it
  7. 7. Data Warehouse Maturity Model Courtesy of Wayne Eckerson
  8. 8. What is a Data Warehouse and why use one? A data warehouse is where you store data from multiple data sources to be used for historical and trend analysis reporting. It acts as a central repository for many subject areas and contains the "single version of truth". It is NOT to be used for OLTP applications. Reasons for a data warehouse:  Reduce stress on production system  Optimized for read access, sequential disk scans  Integrate many sources of data  Keep historical records (no need to save hardcopy reports)  Restructure/rename tables and fields, model data  Protect against source system upgrades  Use Master Data Management, including hierarchies  No IT involvement needed for users to create reports  Improve data quality and plugs holes in source systems  One version of the truth  Easy to create BI solutions on top of it (i.e. SSAS Cubes)
  9. 9. Why use a Data Warehouse? Legacy applications + databases = chaos Production Control MRP Inventory Control Parts Management Logistics Shipping Raw Goods Order Control Purchasing Marketing Finance Sales Accounting Management Reporting Engineering Actuarial Human Resources Continuity Consolidation Control Compliance Collaboration Enterprise data warehouse = order Single version of the truth Enterprise Data Warehouse Every question = decision Two purposes of data warehouse: 1) save time building reports; 2) slice in dice in ways you could not do before
  10. 10. Hardware Solutions  Fast Track Data Warehouse - A reference configuration optimized for data warehousing. This saves an organization from having to commit resources to configure and build the server hardware. Fast Track Data Warehouse hardware is tested for data warehousing which eliminates guesswork and is designed to save you months of configuration, setup, testing and tuning. You just need to install the OS and SQL Server  Appliances - Microsoft has made available SQL Server appliances (SMP and MPP) that allow customers to deploy data warehouse (DW), business intelligence (BI) and database consolidation solutions in a very short time, with all the components pre-configured and pre-optimized. These appliances include all the hardware, software and services for a complete, ready-to-run, out-of-the-box, high performance, energy-efficient solutions
  11. 11. Data Warehouse Fast Track for SQL Server 2014 Hardware system design • Tight specifications for servers, storage, and networking • Resource balanced and validated • Latest-generation servers and storage, including solid-state disks (SSDs) Database configuration • Workload-specific • Database architecture • SQL Server settings • Windows Server settings • Performance guidance Software • SQL Server 2014 Enterprise • Windows Server 2012 R2 Processors Networking Servers Storage Windows Server 2012 R2 SQL Server 2014
  12. 12. Options for data warehouse solutions Balancing flexibility and choice By yourself With a reference architecture With an appliance Tuning and optimization Installation Configuration Tuning and optimization Installation Configuration Installation Tuning and optimization HIGH LOW Time to solution Optional, if you have hardware already Existing or procured hardware and support Procured software and support Offerings • SQL Server 2014 • Windows Server 2012 R2 • System Center 2012 SP1 Offerings • Private Cloud Fast Track • Data Warehouse Fast Track Offerings • Data Warehouse Fast Track • Analytics Platform System Existing or procured hardware and support Procured software and support Procured appliance and support HIGH Price
  13. 13. Data Warehouse Fast Track advantages Flexibility and ChoiceReduced riskFaster Deployment
  14. 14. Vendors with 2014 Fast Track Appliances  Dell  EMC  HP/ScanDisk  Lenovo  NEC  Tegile
  15. 15. Data Warehouse vs Data Mart  Data Warehouse: A single organizational repository of enterprise wide data across many or all subject areas  Holds multiple subject areas  Holds very detailed information  Works to integrate all data sources  Feeds data mart  Data Mart: Subset of the data warehouse that is usually oriented to specific subject (finance, marketing, sales) • The logical combination of all the data marts is a data warehouse In short, a data warehouse as contains many subject areas, and a data mart contains just one of those subject areas
  16. 16. Kimball and Inmon Methodologies Two approaches for building data warehouses
  17. 17. Kimball and Inmon Myths  Myth: Kimball is a bottom-up approach without enterprise focus  Really top-down: BUS matrix (choose business processes/data sources), conformed dimensions, MDM  Myth: Inmon requires a ton of up-front design that takes a long time  Inmon says to build DW iteratively, not big bang approach (p. 91 BDW, p. 21 Imhoff)  Myth: Star schema data marts are not allowed in Inmon’s model  Inmon says they are good for direct end-user access of data (p. 365 BDW), good for data marts (p. 12 TTA)  Myth: Very few companies use the Inmon method  Survey had 39% Inmon vs 26% Kimball. Many have a EDW  Myth: The Kimball and Inmon architectures are incompatible  They can work together to provide a better solution
  18. 18. Kimball and Inmon Methodologies  Relational (Inmon) vs Dimensional (Kimball)  Relational Modeling:  Entity-Relationship (ER) model  Normalization rules  Many tables using joins  History tables, natural keys  Good for indirect end-user access of data  Dimensional Modeling:  Facts and dimensions, star schema  Less tables but have duplicate data (de-normalized)  Easier for user to understand (but strange for IT people used to relational)  Slowly changing dimensions, surrogate keys  Good for direct end-user access of data
  19. 19. Relational Model vs Dimensional Model Relational Model Dimensional Model If you are a business user, which model is easier to use?
  20. 20. Kimball and Inmon Methodologies • Kimball: • Logical data warehouse (BUS), made up of subject areas (data marts) • Business driven, users have active participation • Decentralized data marts (not required to be a separate physical data store) • Independent dimensional data marts optimized for reporting/analytics • Integrated via Conformed Dimensions (provides consistency across data sources) • 2-tier (data mart, cube), less ETL, no data duplication • Inmon: • Enterprise data model (CIF) that is a enterprise data warehouse (EDW) • IT Driven, users have passive participation • Centralized atomic normalized tables (off limit to end users) • Later create dependent data marts that are separate physical subsets of data and can be used for multiple purposes • Integration via enterprise data model • 3-tier (data warehouse, data mart, cube), duplication of data
  21. 21. Kimball Model Why staging: Limit source contention (ELT), Recoverability, Backup, Auditing
  22. 22. Inmon Model
  23. 23. Reasons to add a Enterprise Data Warehouse  Single version of the truth  May make building dimensions easier using lightly denormalized tables in EDW instead of going directly from the OLTP source  Normalized EDW results in enterprise-wide consistency which makes it easier to spawn-off the data marts at the expense of duplicated data  Less daily ETL refresh and reconciliation if have many sources and many data marts in multiple databases  One place to control data (no duplication of effort and data)  Reason not to: If have a few sources that need reporting quickly
  24. 24. Which model to use?  Models are not that different, having become similar over the years, and can compliment each other  Boils down to Inmon creates a normalized DW before creating a dimensional data mart and Kimball skips the normalized DW  With tweaks to each model, they look very similar (adding a normalized EDW to Kimball, dimensionally structured data marts to Inmon)  Bottom line: Understand both approaches and pick parts from both for your situation – no need to just choose just one approach  BUT, no solution will be effective unless you possess soft skills (leadership, communication, planning, and interpersonal relationships)
  25. 25. Hybrid Model Advice: Use SQL Server Views to interface between each level in the model In the DW Bus Architecture, each data mart could be a schema (broken out by business process subject areas), all in one database. Another option is to have each data mart in its own database with all databases on one server or spread among multiple servers. Also, the staging areas, CIF, and DW Bus can all be on the same powerful server (MPP)
  26. 26. Kimball Methodology From: Kimball’s The Microsoft Data Warehouse Toolkit Kimball defines a development lifecycle, where Inmon is just about the data warehouse (not “how” used)
  27. 27. Populating a Data Warehouse  Determine frequency of data pull (daily, weekly, etc)  Full Extraction – All data (usually dimension tables)  Incremental Extraction – Only data changed from last run (fact tables)  How to determine data that has changed  Timestamp - Last Updated  Change Data Capture (CDC)  Partitioning by date  Triggers on tables  MERGE SQL Statement  Column DEFAULT value populated with date  Online Extraction – Data from source. First create copy of source:  Replication  Database Snapshot  Availability Groups  Offline Extraction – Data from flat file
  28. 28. ETL vs ELT • Extract, Transform, and Load (ETL) • Transform while hitting source system • No staging tables • Processing done by ETL tools (SSIS) • Extract, Load, Transform (ELT) • Uses staging tables • Processing done by target database engine (SSIS: Execute T-SQL Statement task instead of Data Flow Transform tasks) • Use for big volumes of data • Use when source and target databases are the same • Use with the Analytics Platform System (APS) ELT is better since database engine is more efficient than SSIS • Best use of database engine: Transformations • Best use of SSIS: Data pipeline and workflow management
  29. 29. Surrogate Keys Surrogate Keys – Unique identifier not derived from source system • Embedded in fact tables as foreign keys to dimension tables • Allows integrating data from multiple source systems • Protect from source key changes in the source system • Allows for slowly changing dimensions • Allows you to create rows in the dimension that don’t exist in the source (-1 in fact table for unassigned) • Improves performance (joins) and database size by using integer type instead of text • Implemented via identity column on dimension tables
  30. 30. SSAS Cubes Reasons to report off cubes instead of the data warehouse:  Aggregating (Summarizing) the data for performance  Multidimensional analysis – slice, dice, drilldown, show details  Can store Hierarchies  Built-in support for KPI’s  Security: You can use the security setting to give end-users access to only those parts (slices) of the cube relevant to them  Built-in advanced time-calculations – i.e. 12-month rolling average  Easily use Excel to view data via Pivot Tables  Automatically handles Slowly Changing Dimensions (SCD)  Required for PerformancePoint, Power View, and SSAS data mining
  31. 31. Data Warehouse Architecture
  32. 32. Modern Data Warehouse Think about future needs: • Increasing data volumes • Real-time performance • New data sources and types (Hadoop) • Cloud-born data Solution – Microsoft Analytics Platform System: • Scalable • MPP architecture • HDInsight • Polybase Follow-on presentation: “Building a Big Data Solution (Building an Effective Data Warehouse Architecture with Hadoop, the cloud, and MPP)”
  33. 33. Resources  Data Warehouse Architecture – Kimball and Inmon methodologies: http://bit.ly/SrzNHy  SQL Server 2012: Multidimensional vs tabular: http://bit.ly/SrzX1x  Data Warehouse vs Data Mart: http://bit.ly/SrAi4p  Fast Track Data Warehouse Reference Architecture for SQL Server 2014: http://bit.ly/1xuX9m6  Complex reporting off a SSAS cube: http://bit.ly/SrAEYw  Surrogate Keys: http://bit.ly/SrAIrp  Normalizing Your Database: http://bit.ly/SrAHnc  Difference between ETL and ELT: http://bit.ly/SrAKQa  Microsoft’s Data Warehouse offerings: http://bit.ly/xAZy9h  Microsoft SQL Server Reference Architecture and Appliances: http://bit.ly/y7bXY5  Methods for populating a data warehouse: http://bit.ly/SrARuZ  Great white paper: Microsoft EDW Architecture, Guidance and Deployment Best Practices: http://bit.ly/SrAZug  End-User Microsoft BI Tools – Clearing up the confusion: http://bit.ly/SrBMLT  Microsoft Appliances: http://bit.ly/YQIXzM  Why You Need a Data Warehouse: http://bit.ly/1fwEq0j  Data Warehouse Maturity Model: http://bit.ly/xl4mGM  Operational Data Store (ODS) Defined: http://bit.ly/1H6wnE7  The Modern Data Warehouse: http://bit.ly/1xuX4Py
  34. 34. Q & A ? James Serra, Big Data Evangelist Email me at: JamesSerra3@gmail.com Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck will be posted under the “Presentations” tab)

×