SlideShare a Scribd company logo
1 of 93
Download to read offline
th
SQL
NIGHT
CHAPTER
Building
Data Warehouse
in SQL Server
Antonios Chatzipavlis
SQLschool.gr Founder, Principal Consultant
SQL Server Evangelist, MVP on SQL Server
May 30, 2015
25
I have been started with computers.
I started my professional carrier in computers industry.
I have been started to work with SQL Server version 6.0
I earned my first certification at Microsoft as Microsoft Certified Solution
Developer (3rd in Greece) and started my carrier as Microsoft Certified
Trainer (MCT) with more than 20.000 hours of training until now!
I became for first time Microsoft MVP on SQL Server
I created the SQL School Greece (www.sqlschool.gr)
I became MCT Regional Lead by Microsoft Learning Program.
I was certified as MCSE : Data Platform, MCSE: Business Intelligence
Antonios Chatzipavlis
Database Architect
SQL Server Evangelist
MCT, MCSE, MCITP, MCPD, MCSD, MCDBA,
MCSA, MCTS, MCAD, MCP, OCA, ITIL-F
1982
1988
1996
1998
2010
2012
2013
CHAPTER
Follow us in social media
Twitter @antoniosch / @sqlschool
Facebook fb/sqlschoolgr
YouTube yt/user/achatzipavlis
LinkedIn SQL School Greece group
Pinterest pi/SQLschool/
help@sqlschool.gr
Stay Involved!
• Sign up for a free membership today at sqlpass.org
• Linked In: http://www.sqlpass.org/linkedin
• Facebook: http://www.sqlpass.org/facebook
• Twitter: @SQLPASS
• PASS: http://www.sqlpass.org
Whatever your data passion – there’s a Virtual Chapter for you!
www.sqlpass.org/vc
Planning on attending PASS Summit 2015? Start
saving today!
• The world’s largest gathering of SQL Server & BI professionals
• Take your SQL Server skills to the next level by learning from the world’s
top SQL Server experts, in over 190 technical sessions
• Over 5000 registrations, representing 2000 companies, from 52
countries, ready to network & learn
Save $150 right now using
discount code LC15CPJ8
$1795
until July 12th, 2015
Don’t miss your chance to vote in the 2015 PASS
elections-update your myPASS profile by June 1!
In order to vote for the 2015 PASS Nomination Committee & the Board of Directors, you need
to complete all mandatory fields in your myPASS profile by 11:59 PM PDT June 1, 2015.
• PASS members will be reminded to review & complete their profiles
• Members will receive instructions for updating profiles and deleting duplicate profiles
• Eligible voters will receive information about key election dates and the voting process after June 1
Head to sqlpass.org/myPASS today!
For more info on elections,
visit to sqlpass.org/elections
• Overview of Data Warehousing
• Data Warehouse Solution
• Data Warehouse Infrastructure
• Data Warehouse Hardware
• Data Warehouse Design Overview
• Designing Dimension Tables
• Designing Fact Tables
• Data Warehouse Physical Design
Agenda
Overview of Data Warehousing
• There are many definitions for the term “data warehouse,”
and disagreements over specific implementation details.
• It is generally agreed that a data warehouse is a centralized
store of business data that can be used for reporting and
analysis to inform key decisions.
• A data warehouse provides a solution to the problem of
distributed data that prevents effective business decision-
making.
What is a Data Warehouse?
The single organizational repository of enterprise wide
data across many or all lines of business and subject
areas.
Contains massive and integrated data.
Represents the complete organizational view of
information needed to run and understand the
business.
Definition of Data Warehouse
• Contains a large volume of data that relates to historical
business transactions.
• Is optimized for read operations that support querying the
data.
• Is loaded with new or updated data at regular intervals.
• Provides the basis for enterprise BI applications.
Data Warehouse characteristics
• Finding the information required for business decision
• This is time-consuming and error-prone.
• Key business data is distributed across multiple systems.
• This makes it hard to collate all the information necessary for a particular
business decision.
• Fundamental business questions are hard to answer.
• Most business decisions require a knowledge of fundamental facts.
• The distribution of data throughout multiple systems in a typical
organization can make them difficult, or even impossible, to answer.
What makes a Data Warehouse useful?
• The specific, subject oriented, or departmental view of
information from the organization.
• Generally these are built to satisfy user requirements for
information
What is a Data Mart?
Data Warehouse Vs Data Mart
Data Warehouse Data Mart
Scope
• Application independent
• Centralized or Enterprise
• Planned
• Specific application
• Decentralized by group
• Organic but may be planned
Data
• Historical, detailed, summary
• Some denormalization
• Some history, detailed, summary
• High denormalization
Subjects • Multiple Subjects • Single central subject area
Sources • Many internal and external sources • Few internal and external sources
Other
• Flexible
• Data oriented
• Long life
• Single complex structure
• Restrictive
• Project oriented
• Short life
• Multiple simple structures
• Centralized Data Warehouse
• Departmental Data Mart
• Hub and Spoke
Data Warehouse Architectures
• Centralized Data Warehouse
• Departmental Data Mart
• Hub and Spoke
Data Warehouse Architectures
• Centralized Data Warehouse
• Departmental Data Mart
• Hub and Spoke
Data Warehouse Architectures
• Centralized Data Warehouse
• Departmental Data Mart
• Hub and Spoke
Data Warehouse Architectures
Components of a Data Warehousing Solution
Data
Warehouse
Master Data
Management
Data
Cleansing
DataSources

ETL
Data
Models
Reporting and Analysis
1. Start by identifying the business questions that the data warehousing
solution must answer
2. Determine the data that is required to answer these questions
3. Identify data sources for the required data
4. Assess the value of each question to key business objectives versus
the feasibility of answering it from the available data
For large enterprise-level projects, an incremental approach can be effective:
• Break the project down into multiple sub-projects
• Each sub-project deals with a particular subject area in the data warehouse
Starting a Data Warehouse Project
Core Data Warehousing
• SQL Server Database
Engine
• SQL Server Integration
Services
• SQL Server Master Data
Services
• SQL Server Data Quality
Services
Enterprise BI
• SQL Server Analysis
Services
• SQL Server Reporting
Services
• Microsoft SharePoint
Server
• Microsoft Office
Self-Service BI
Big Data Analysis
• Excel Add-ins
(PowerPivot, Power
Query, Power View,
Power Map)
• Microsoft Office 365
Power BI
• Windows Azure
HDInsight
SQL Server As a Data Warehousing Platform
Data Warehouse Solution
A data warehouse
is a relational database that
is optimized for reading data
for analysis and reporting.
Keep in mind
• Logical:
• Is typically designed to denormalize data into a structure that minimizes the
number of join operations required in the queries used to retrieve and
aggregate data.
• A common approach is to design a star schema
• Physical:
• Affect the performance and manageability of the data warehouse
Logical and Physical Database schema
• Query processing requirements, including anticipated peak
memory and CPU utilization.
• Storage volume and disk input/output requirements.
• Network connectivity and bandwidth.
• Component redundancy for high availability.
Hardware selection
• Failover time requirements.
• Configuration and management complexity.
• The volume of data in the data warehouse.
• The frequency of changes to data in the data warehouse.
• The effect of the backup process on data warehouse
performance.
• The time to recover the database in the event of a failure.
High availability and Disaster Recovery
• The authentication mechanisms that you must support to
provide access to the data warehouse.
• The permissions that the various users who access the data
warehouse will require.
• The connections over which data is accessed.
• The physical security of the database and backup media.
Security
• Data Source Connection Types
• Credentials and Permissions
• Data Formats
• Data Acquisition Windows
Data sources
• Staging:
• What data must be staged?
• Staging data format
• Required transformations:
• Transformations during extraction versus data flow transformations
• Incremental ETL:
• Identifying data changes for extraction
• Inserting or updating when loading
ETL Processes
• Data quality:
• Cleansing data:
• Validating data values
• Ensuring data consistency
• Identifying missing values
• Deduplicating data
• Master data management:
• Ensuring consistent business entity definitions across multiple systems
• Applying business rules to ensure data validity
Data Quality and Master Data Management
Data Warehouse Infrastructure
Data volume
• The amount of data that the data warehouse must store
• The size and frequency of incremental loads of new data.
• The primary consideration is the number of rows in fact tables
• But don’t forget dimension data, indexes, and data models
that are stored on disk.
System Sizing Factors
Analysis and Reporting Complexity
• This includes the number, complexity, and predictability of the
queries that will be used to analyze the data or produce reports.
• Typically, BI solutions must support a mix of the following query
types:
• Simple. Relatively straightforward SELECT statements.
• Medium. Repeatedly executed queries that include aggregations or many joins.
• Complex. Unpredictable queries with complex aggregations, joins, and
calculations.
System Sizing Factors
Number of Users
• This is the total number of information workers who will
access the system, and how many of them will do so
concurrently.
Availability Requirements
• These include when the system will need to be used, and
what planned or unplanned downtime the business can
tolerate.
System Sizing Factors
Typical System Categorization
Small Medium Large
Data Volume 100s of GBs to 1 TB 1 to 10 TB 10 TB to 100s of TBs
Analysis and
Reporting
Complexity
Over 50% simple
30% medium
Less than 10% complex
50% simple
30-35% medium
10-15% complex
30-35% simple
40% medium
20-25% complex
Number of Users
100 total
10 to 20 concurrent
1,000 total
100 to 200 concurrent
1,000s of concurrent
users
Availability
Requirements
Business hours
1 hour of downtime per
night
24/7 operations
Data Warehouse Workloads
ETL
• Control flow tasks
• Data query and insert
• Network data transfer
• In-memory data pipeline
• SSIS Catalog or MSDB I/O
Reporting
• Client requests
• Data source queries
• Report rendering
• Caching
• Snapshot execution
• Subscription processing
• Report Server Catalog I/O
Operations and
Maintenance
• OS activity
• Logging
• SQL Server Agent Jobs
• SSIS packages
• Indexes
• Backups
DW
• Processing
• Aggregation storage
• Multidimensional on disk
• Tabular in memory
• Query execution
Cubes
Typical Server Topologies for a BI Solution
Single Server
Architecture
DW
Distributed
Architecture
ServersFew Many
Hardware costs
Software license costs
Configuration complexity
Scalability & Performance
Flexibility
Scaling-out a BI Solution
Analysis ServicesData Warehouse
Integration Services Reporting Services
• Partitioning the data
across multiple
database servers
• SQL Server Parallel
Data Warehouse
edition
• Install the Reporting
Services database on a
single database server,
• Then install the
Reporting Services
report server service
on multiple servers
that all connect to the
same Reporting
Services database.
Create a read-only copy of a
multidimensional database and
connect to it from multiple Analysis
Services query servers.
Use multiple SSIS
servers to perform a
subset of the ETL
processes in parallel
Planning for High Availability
• AlwaysOn Failover
Cluster
• RAID Storage
• AlwaysOn Failover
Cluster
• AlwaysOn Availability
Group
• NLB Report
Servers
• AlwaysOn
Availability
Group
• AlwaysOn
Failover Cluster
Data Warehouse
Analysis Services
Integration Services
Reporting Services
Data Warehouse Hardware
• A DW usually has longer-running queries
• A DW has higher read activity than write activity
• The data in DW is usually more static
• In a DW it is much more important to be able to process a
large amount of data quickly, than it is to support a high
number of I/O operations per second
Keep in mind
• Determine initial data volume
• Number of fact table rows x row size
• Use 100 bytes per row as an estimate if unknown
• Add 30-40% for dimensions and indexes
• Project data growth
• Number of new fact rows per month
• Factor in compression
• Typically 3:1
Determining Storage Requirements
Other storage requirements
• Configuration databases
• Log files
• TempDB
• Staging tables
• Backups
• Analysis Services models
• Use more smaller disks instead of fewer larger disks
• Use the fastest disks you can afford
• Consider solid state disks especially for random I/O
• Use RAID 10, or minimally RAID 5
• Consider a dedicated storage area network for manageability
and extensibility
• Balance I/O across enclosures, storage processors, and disk groups
Considerations for Storage Hardware
Server size Minimum memory Maximum memory
1 socket 64 GB 128 GB
2 sockets 128 GB 256 GB
4 sockets 256 GB 512 GB
8 sockets 512 GB 768 GB
Server Memory
• Determine core MCR
• Apply formula to estimate required number of cores:
Estimating CPU Requirements
𝐶𝑃𝑈𝑠 =
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑞𝑢𝑒𝑟𝑦 𝑠𝑖𝑧𝑒 𝑖𝑛 𝑀𝐵
𝑀𝐶𝑅
𝑥 𝐶𝑜𝑛𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑢𝑠𝑒𝑟𝑠
𝑇𝑟𝑎𝑟𝑔𝑒𝑡 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑡𝑖𝑚𝑒
• This metric measures the maximum SQL Server data processing rate for a
standard query and data set for a specific server and CPU combination.
• This is provided as a per-core rate, and it is measured as a query-based scan
from memory cache.
• MCR is the initial starting point for Fast Track system design.
• It represents an estimated maximum required I/O bandwidth for the server, CPU,
and workload.
• MCR is useful as an initial design guide because it requires only minimal local
storage and database schema to estimate potential throughput for a given CPU.
• It is not a measure of system performance.
Maximum Consumption Rate (MCR)
• Create a reference dataset based on the TPC-H line item table or similar data set.
• The table should be of a size that it can be entirely cached in the SQL Server buffer pool yet still
maintain a minimum one-second execution time for the query provided here.
• For FTDW the following query is used:
SELECT sum([integer field]) FROM [table]
WHERE [restrict to appropriate data volume]
GROUP BY [col].
• Ensure that Resource Governor settings are at default values.
• Ensure that the query is executing from the buffer cache.
• Executing the query once should put the pages into the buffer, and subsequent executions should
read fully from buffer. Validate that there are no physical reads in the query statistics output.
Calculate MCR
• Set STATISTICS IO and STATISTICS TIME to ON to output results.
• Run the query multiple times, at MAXDOP = 1.
• Record the number of logical reads and CPU time from the statistics
output for each query execution.
• Calculate the MCR in MB/s using the formula:
( [Logical reads] / [CPU time in seconds] ) * 8KB / 1024
• A consistent range of values (+/- 5%) should appear over a
minimum of five query executions.
• Significant outliers (+/- 20% or more) may indicate configuration issues. The
average of at least 5 calculated results is the FTDW MCR.
Calculate MCR
• Special SQL Server Edition only available in hardware appliances
• Massively parallel processing
• Shared-nothing architecture
• Dedicated control nodes, compute nodes, and storage nodes
SQL Server Parallel Data Warehouse
DualFiberChannel
Database servers
(compute nodes)
Infiniband
Storage Arrays
Control Node
Cluster
Management
Servers
Landing Zone
(ETL Interface)
Backup Nodes
Data Warehouse Design Overview
The Dimensional Model
Fact
Dimension
Dimension
Dimension
Dimension
Dimension
Dimension
Snowflake schema
Star schema
Measures
Attributes
Attributes
Attributes
Attributes
Attributes
Attributes
• Identify the grain
• Select the required dimensions
• Identify the facts
Dimensional Modeling
• The grain of a dimensional model is the lowest level of detail
at which you can aggregate the measures.
• It is important to choose the level of grain that will support the
most granular of reporting and analytical requirements
• Typically the lowest level possible from the source data is the
best option.
Identify the grain
• Determine which of the dimensions related to the business
process should be included in the model
• The selection of dimensions depends on the reporting and
analytical requirements, specifically on the business entities by
which users need to aggregate the measures
• Almost all dimensional models include a time-based
dimension
Select the required dimensions
• Identify the facts that you want to include as measures.
• Measures are numeric values that can be expressed at the
level of the grain chosen earlier and aggregated across the
selected dimensions.
• Depending on the grain you choose for the dimensional
model and the grain of the source data, you might need to
allocate measures from a higher level of grain across multiple
fact rows.
Identify the facts
Documenting Dimensional Models
Sales Order
Item Quantity
Unit Cost
Total Cost
Unit Price
Sales Amount
Shipping Cost
Time
(Order Date and
Ship Date)
Salesperson
CustomerProduct
Calendar Year
Month
Date
Fiscal Year
Fiscal Quarter
Month
Date
Region
Country
Territory
Manager
Name
Name
Country
State or Province
City
Age
Marital Status
Gender
Category
Subcategory
Product Name
Color
Size
Designing Dimension Tables
Each row
in a dimension table represents
an instance of a business entity
by which the measures
in the fact table
can be aggregated
Keep in mind
• A key column uniquely identifies each row in the dimension table.
• Usually the dimension data is obtained from a source system in
which a key is already assigned, this is the “business key”
• It is standard practice to define a new “surrogate key” that uses an
integer value to identify each row.
• A surrogate key is recommended for the following reasons:
• The data warehouse might use dimension data from multiple source systems, so it is possible
that business keys are not unique.
• Some source systems use non-numeric keys, such as a globally unique identifier (GUID), or
natural keys, such as an email address, to uniquely identify data entities. Integer keys are smaller
and more efficient to use in joins from fact tables.
• If the dimension table supports “Type 2” slowly-changing dimensions.
Dimension keys
Dimension keys
ProductKey ProductAltKey ProductName Color Size
1 MB1-B-32 MB1 Mountain Bike Blue 32
2 MB1-R-32 MB1 Mountain Bike Red 32
CustomerKey CustomerAltKey Name
1 1002 Amy Alberts
2 1005 Neil Black
Surrogate Key Business (Alternate) Key
• Hierarchies
• Multiple attributes can be combined to form hierarchies that enable users to drill
down into deeper levels of detail.
• Business users can view aggregated fact data at each level
• Slicers
• Attributes do not need to form hierarchies to be useful in analysis and reporting.
• Business users can group or filter data based on single-level hierarchies to create
analytical sub-groupings of data.
• Drill-through detail
• Some attributes have little value as slicers or members of a hierarchy.
• It can be useful to include entity-specific attributes to facilitate drill-through
functionality in reports or analytical applications.
Dimension Attributes and Hierarchies
Dimension Attributes and Hierarchies
CustKey CustAltKey Name Country State City Phone Gender
1 1002 Amy Alberts Canada BC Vancouver 555 123 F
2 1005 Neil Black USA CA Irvine 555 321 M
3 1006 Ye Xu USA NY New York 555 222 M
Hierarchy
SlicerDrill-through detail
• Identify the semantic meaning of NULL
• Unknown or None?
• Do not assume NULL equality
• Use ISNULL( )
Unknown and None
OrderNo Discount DiscountType
1000 1.20 Bulk Discount
1001 0.00 N/A
1002 2.00
1003 0.50 Promotion
1004 2.50 Other
1005 0.00 N/A
1006 1.50
Source
DimensionTable
DiscKey DiscAltKey DiscountType
-1 Unknown Unknown
0 N/A None
1 Bulk Discount Bulk Discount
2 Promotion Promotion
3 Other Other
• The simplest type of SCD to implement.
• Attribute values are updated directly in the existing dimension table
row and no history is maintained.
• Suitable for attributes that are used to provide drill-through details
• Unsuitable for analytical slicers or hierarchy members where historic
comparisons must reflect the attribute values as they were at the
time of the fact event.
Slowly Changing Dimensions – Type 1
CustKey CustAltKey Name Phone
1 1002 Amy Alberts 555 123
CustKey CustAltKey Name Phone
1 1002 Amy Alberts 555 222
• These changes involve the creation of a fresh version of the dimension entity in the form of a new
row.
• Typically, a bit column in the dimension table is used as a flag to indicate which version of the
dimension row is the current one.
• Additionally, datetime columns are often used to indicate the start and end of the period for which a
version of the row was (or is) current.
• Maintaining start and end dates makes it easier to assign the appropriate foreign key value to fact rows as they are
loaded so they are related to the version of the dimension entity that was current at the time the fact occurred.
Slowly Changing Dimensions – Type 2
CustKey CustAltKey Name City Current Start End
1 1002 Amy Alberts Vancouver Yes 1/1/2000
CustKey CustAltKey Name City Current Start End
1 1002 Amy Alberts Vancouver No 1/1/2000 1/1/2012
4 1002 Amy Alberts Toronto Yes 1/1/2012
• Rarely used
• The previous value (or a complete history of previous values) is
maintained in the dimension table row.
• This requires modifying the dimension table schema to
accommodate new values for each tracked attribute, and can result
in a complex dimension table that is difficult to manage.
Slowly Changing Dimensions – Type 3
CustKey CustAltKey Name Cars
1 1002 Amy Alberts 0
CustKey CustAltKey Name Prior Cars Current Cars
1 1002 Amy Alberts 0 1
• Surrogate key
• Granularity
• Range
Time Dimension
DateKey DateAltKey MonthDay WeekDay Day MonthNo Month Year
00000000 01-01-1753 NULL NULL NULL NULL NULL NULL
20130101 01-01-2013 1 3 Tue 01 Jan 2013
20130102 01-02-2013 2 4 Wed 01 Jan 2013
20130103 01-03-2013 3 5 Thu 01 Jan 2013
20130104 01-04-2013 4 6 Fri 01 Jan 2013
• Attributes and hierarchies
• Multiple calendars
• Unknown values
• Create a Transact-SQL script
• Use Microsoft Excel
• Use a BI tool to autogenerate a time dimension table
Populating a Time Dimension Table
• A common requirement in a data warehouse is to support
dimensions with parent-child hierarchies
• Typically, parent-child hierarchies are implemented as self-
referencing tables, in which a column in each row is used as a
foreign-key reference to a primary-key value in the same
table
Self-Referencing Dimension
EmployeeKey EmployeeAltKey EmployeeName ManagerKey
1 1000 Kim Abercrombie NULL
2 1001 Kamil Amireh 1
3 1002 Cesar Garcia 1
4 1003 Jeff Hay 2
• Combine low-cardinality attributes that don’t belong in
existing dimensions into a junk dimension
• Avoids creating many small dimension tables
Junk Dimensions
JunkKey OutOfStockFlag FreeShippingFlag CreditOrDebit
1 1 1 Credit
2 1 1 Debit
3 1 0 Credit
4 1 0 Debit
5 0 1 Credit
6 0 1 Debit
7 0 0 Credit
8 0 0 Debit
Designing Fact Tables
Fact Table Columns
OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount
20120101 25 120 1000 1 350.99
20120101 99 120 1000 2 6.98
20120101 25 178 1001 2 701.98
OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount
20120101 25 120 1000 1 350.99
20120101 99 120 1000 2 6.98
20120101 25 178 1001 2 701.98
OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount
20120101 25 120 1000 1 350.99
20120101 99 120 1000 2 6.98
20120101 25 178 1001 2 701.98
Dimension Keys
Measures
Degenerate Dimensions
Types of Measure
OrderDateKey ProductKey CustomerKey SalesAmount
20120101 25 120 350.99
20120101 99 120 6.98
20120102 25 178 701.98
DateKey ProductKey StockCount
20120101 25 23
20120101 99 118
20120102 25 22
OrderDateKey ProductKey CustomerKey ProfitMargin
20120101 25 120 25
20120101 99 120 22
20120102 25 178 27
Additive measures
Semi-additive measures
Non-additive measures
Types of Fact Table
OrderDateKey ProductKey CustomerKey OrderNo Qty Cost SalesAmount
20120101 25 120 1000 1 125.00 350.99
20120101 99 120 1000 2 2.50 6.98
20120101 25 178 1001 2 250.00 701.98
DateKey ProductKey OpeningStock UnitsIn UnitsOut ClosingStock
20120101 25 25 1 3 23
20120101 99 120 0 2 118
OrderNo OrderDateKey ShipDateKey DeliveryDateKey
1000 20120101 20120102 20120105
1001 20120101 20120102 00000000
1002 20120102 00000000 00000000
Transaction
fact tables
Periodic
snapshot tables
Accumulating
snapshot
fact tables
Data Warehouse Physical Design
Understanding DW Components Activity
ETL
Data Models
Reports
User Queries
ETL Loads
• Bulk inserts
• Some lookups and
updates
• Large fact
tables
• Star joins to
dimension
tables
Data Model Processing
• Mostly table/index
scans
Report Processing
• Predictable queries
• Many rows with range-based
query filters
Self-Service BI
• Potentially
unpredictable
queries
• Create files with an initial size
• Based on the eventual size of the objects that will be stored on them
• This pre-allocates sequential disk blocks and helps avoid fragmentation.
• Disable autogrowth
• If you begin to run out of space in a data file, it is more efficient to explicitly
increase the file size by a large amount rather than rely on incremental
autogrowth.
Data files guidelines
• Create at least one filegroup in addition to the primary one, and
then set it as the default filegroup so you can separate data tables
from system tables.
• Create dedicated filegroups for extremely large fact tables and using
them to place those fact tables on their own logical disks.
• If some tables in the data warehouse are loaded on a different
schedule from others, consider using filegroups to separate the
tables into groups that can be backed up independently.
• If you intend to partition a large fact table, create a filegroup for
each one so that older, stable rows can be backed up, and then set
as read-only.
Filegroups guidelines
• Separate staging database
• Create it on a logical disk distinct from the data warehouse files.
• Into the data warehouse database
• Create a file and filegroup for them on a logical disk
• Separate from the fact and dimension tables.
• An exception to the previous guideline is made for staging tables that will be
switched with partitions to perform fast loads.
• These must be created on the same filegroup as the partition with which they will be
switched.
Staging tables
• To avoid fragmentation of data files
• Place it on a dedicated logical disk
• Set its initial size based on how much it is likely to be used.
• Set the growth increment to be quite large to ensure that
performance is not interrupted by frequent growth of
TempDB.
• Creating multiple files for TempDB to help minimize
contention during page free space (PFS) scans as temporary
objects are created and dropped.
TempDB
• Set the transaction mode of the Data Warehouse, Staging
Database and TempDB to Simple
• Helps to avoid having to truncate transaction logs
• Additionally, most of the inserts in a data warehouse are
typically performed as bulk load operations which are not
logged.
• To avoid disk resource conflicts between data warehouse I/O
and logging, place the transaction log files for all databases
on a dedicated logical disk.
Transaction logs
• SQL Server Enterprise edition supports data compression at
both page and row level.
• Data compression benefits in a data warehouse
• Reduced storage requirements.
• Improved query performance
• Best practices for data compression in a data warehouse
• Use page compression on all dimension tables and fact table partitions.
• If performance is CPU-bound, revert to row compression on frequently-
accessed partitions.
Data Compression
• Improved query performance
• More granular manageability
• Improved data load performance
• Best practices for partitioning in a DW
• Partition Large Fact Tables
• Partition on an incrementing date key
• Design the partition scheme for ETL and manageability.
• Maintain an empty partition at the start and end of the table
Table Partitioning
• Indexes maximize query performance
• Planning Indexes is the most important part of database
design process
• Some inexperienced BI professionals are tempted to create
many indexes on all tables to support queries.
Indexes in DW
• Create a clustered index on the surrogate key column.
• This column is used to join the dimension table to fact tables, and a clustered
index will help the query optimizer minimize the number of reads required to filter
fact rows.
• Create a non-clustered index on the alternate key column and
include the SCD current flag, start date, and end date columns.
• This index will improve the performance of lookup operations during ETL data
loads that need to handle slowly-changing dimensions.
• Create non-clustered indexes on frequently searched attributes, and
consider including all members of a hierarchy in a single index.
Dimension table indexes
• Create a clustered index on the most commonly-searched
date key.
• Date ranges are the most common filtering criteria in most data warehouse
workloads, so a clustered index on this key should be particularly effective in
improving overall query performance.
• Create non-clustered indexes on other, frequently-searched
dimension keys.
• Columnstore index on all columns
Fact table indexes
• Create a view for each dimension and fact table with
NOLOCK query hint in the view definition
• Create views with user-friendly view and column names
• Do not include metadata columns in views
• Create views to combine snowflake dimension tables
• Partition-align indexed views
• Use the SCHEMABINDING option
• Security
Using Views in a DW
• Overview of Data Warehousing
• Data Warehouse Solution
• Data Warehouse Infrastructure
• Data Warehouse Hardware
• Data Warehouse Design Overview
• Designing Dimension Tables
• Designing Fact Tables
• Data Warehouse Physical Design
Summary
Thank you
SELECT
KNOWLEDGE
FROM
SQL SERVER
http://www.sqlschool.gr
Copyright © 2015 SQL School Greece

More Related Content

What's hot

Dimensional modelling-mod-3
Dimensional modelling-mod-3Dimensional modelling-mod-3
Dimensional modelling-mod-3Malik Alig
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureDATAVERSITY
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra Nikiforos Botis
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPconfluent
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 
Traditional data warehouse vs data lake
Traditional data warehouse vs data lakeTraditional data warehouse vs data lake
Traditional data warehouse vs data lakeBHASKAR CHAUDHURY
 
Snowflake Data Loading.pptx
Snowflake Data Loading.pptxSnowflake Data Loading.pptx
Snowflake Data Loading.pptxParag860410
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Oracle Cloud Infrastructure – Compute
Oracle Cloud Infrastructure – ComputeOracle Cloud Infrastructure – Compute
Oracle Cloud Infrastructure – ComputeMarketingArrowECS_CZ
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...DATAVERSITY
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Overview SQL Server 2019
Overview SQL Server 2019Overview SQL Server 2019
Overview SQL Server 2019Juan Fabian
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 

What's hot (20)

Dimensional modelling-mod-3
Dimensional modelling-mod-3Dimensional modelling-mod-3
Dimensional modelling-mod-3
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Traditional data warehouse vs data lake
Traditional data warehouse vs data lakeTraditional data warehouse vs data lake
Traditional data warehouse vs data lake
 
Snowflake Data Loading.pptx
Snowflake Data Loading.pptxSnowflake Data Loading.pptx
Snowflake Data Loading.pptx
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Oracle Cloud Infrastructure – Compute
Oracle Cloud Infrastructure – ComputeOracle Cloud Infrastructure – Compute
Oracle Cloud Infrastructure – Compute
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Overview SQL Server 2019
Overview SQL Server 2019Overview SQL Server 2019
Overview SQL Server 2019
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 

Viewers also liked

Microsoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMicrosoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMark Kromer
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecturepcherukumalla
 
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
Best Practices –  Extreme Performance with Data Warehousing  on Oracle Databa...Best Practices –  Extreme Performance with Data Warehousing  on Oracle Databa...
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...Edgar Alejandro Villegas
 
Department Row Level Security Customization For People Soft General Ledger.Ppt
Department Row Level Security Customization For People Soft General Ledger.PptDepartment Row Level Security Customization For People Soft General Ledger.Ppt
Department Row Level Security Customization For People Soft General Ledger.Pptwonga6
 
Pre and post tips to installing sql server correctly
Pre and post tips to installing sql server correctlyPre and post tips to installing sql server correctly
Pre and post tips to installing sql server correctlyAntonios Chatzipavlis
 
Project 2 Final presentation. December 2016
Project 2 Final presentation. December 2016Project 2 Final presentation. December 2016
Project 2 Final presentation. December 2016enterpriseresearchcentre
 
Implementing Mobile Reports in SQL Sserver 2016 Reporting Services
Implementing Mobile Reports in SQL Sserver 2016 Reporting ServicesImplementing Mobile Reports in SQL Sserver 2016 Reporting Services
Implementing Mobile Reports in SQL Sserver 2016 Reporting ServicesAntonios Chatzipavlis
 
What's New for the BI workload in SharePoint 2016 and SQL Server 2016
What's New for the BI workload in SharePoint 2016 and SQL Server 2016What's New for the BI workload in SharePoint 2016 and SQL Server 2016
What's New for the BI workload in SharePoint 2016 and SQL Server 2016SPC Adriatics
 
TIQ Solutions - QlikView Data Integration in a Java World
TIQ Solutions - QlikView Data Integration in a Java WorldTIQ Solutions - QlikView Data Integration in a Java World
TIQ Solutions - QlikView Data Integration in a Java WorldVizlib Ltd.
 
KPerry - 20463 Implementing a Data Warehouse with Microsoft® SQL Server (2)
KPerry - 20463 Implementing a Data Warehouse with Microsoft® SQL Server (2)KPerry - 20463 Implementing a Data Warehouse with Microsoft® SQL Server (2)
KPerry - 20463 Implementing a Data Warehouse with Microsoft® SQL Server (2)Kwame M. Perry
 
V4 qlik view-datastorage
V4 qlik view-datastorageV4 qlik view-datastorage
V4 qlik view-datastoragenaresh akki
 
Dynamic data masking sql server 2016
Dynamic data masking sql server 2016Dynamic data masking sql server 2016
Dynamic data masking sql server 2016Antonios Chatzipavlis
 
Different ways to load data in qlikview
Different ways to load data in qlikviewDifferent ways to load data in qlikview
Different ways to load data in qlikviewSwamy Danthuri
 

Viewers also liked (20)

Microsoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMicrosoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAs
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Hadoop
HadoopHadoop
Hadoop
 
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
Best Practices –  Extreme Performance with Data Warehousing  on Oracle Databa...Best Practices –  Extreme Performance with Data Warehousing  on Oracle Databa...
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
 
Stretch db sql server 2016 (sn0028)
Stretch db   sql server 2016 (sn0028)Stretch db   sql server 2016 (sn0028)
Stretch db sql server 2016 (sn0028)
 
Department Row Level Security Customization For People Soft General Ledger.Ppt
Department Row Level Security Customization For People Soft General Ledger.PptDepartment Row Level Security Customization For People Soft General Ledger.Ppt
Department Row Level Security Customization For People Soft General Ledger.Ppt
 
Pre and post tips to installing sql server correctly
Pre and post tips to installing sql server correctlyPre and post tips to installing sql server correctly
Pre and post tips to installing sql server correctly
 
Auditing Data Access in SQL Server
Auditing Data Access in SQL ServerAuditing Data Access in SQL Server
Auditing Data Access in SQL Server
 
Project 2 Final presentation. December 2016
Project 2 Final presentation. December 2016Project 2 Final presentation. December 2016
Project 2 Final presentation. December 2016
 
Row level security
Row level securityRow level security
Row level security
 
Implementing Mobile Reports in SQL Sserver 2016 Reporting Services
Implementing Mobile Reports in SQL Sserver 2016 Reporting ServicesImplementing Mobile Reports in SQL Sserver 2016 Reporting Services
Implementing Mobile Reports in SQL Sserver 2016 Reporting Services
 
What's New for the BI workload in SharePoint 2016 and SQL Server 2016
What's New for the BI workload in SharePoint 2016 and SQL Server 2016What's New for the BI workload in SharePoint 2016 and SQL Server 2016
What's New for the BI workload in SharePoint 2016 and SQL Server 2016
 
Troubleshooting sql server
Troubleshooting sql serverTroubleshooting sql server
Troubleshooting sql server
 
Llorance New Horizons 20768 Developing SQL Data Models
Llorance New Horizons 20768 Developing SQL Data ModelsLlorance New Horizons 20768 Developing SQL Data Models
Llorance New Horizons 20768 Developing SQL Data Models
 
TIQ Solutions - QlikView Data Integration in a Java World
TIQ Solutions - QlikView Data Integration in a Java WorldTIQ Solutions - QlikView Data Integration in a Java World
TIQ Solutions - QlikView Data Integration in a Java World
 
KPerry - 20463 Implementing a Data Warehouse with Microsoft® SQL Server (2)
KPerry - 20463 Implementing a Data Warehouse with Microsoft® SQL Server (2)KPerry - 20463 Implementing a Data Warehouse with Microsoft® SQL Server (2)
KPerry - 20463 Implementing a Data Warehouse with Microsoft® SQL Server (2)
 
V4 qlik view-datastorage
V4 qlik view-datastorageV4 qlik view-datastorage
V4 qlik view-datastorage
 
Dynamic data masking sql server 2016
Dynamic data masking sql server 2016Dynamic data masking sql server 2016
Dynamic data masking sql server 2016
 
Different ways to load data in qlikview
Different ways to load data in qlikviewDifferent ways to load data in qlikview
Different ways to load data in qlikview
 

Similar to Building Data Warehouse in SQL Server

Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Role of Database Management System in A Data Warehouse
Role of Database Management System in A Data Warehouse Role of Database Management System in A Data Warehouse
Role of Database Management System in A Data Warehouse Lesa Cote
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Victor Holman
 
Data warehouse introduction
Data warehouse introductionData warehouse introduction
Data warehouse introductionMurli Jha
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationDATAVERSITY
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse OptimizationCloudera, Inc.
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsKellyn Pot'Vin-Gorman
 
Cognos datawarehouse
Cognos datawarehouseCognos datawarehouse
Cognos datawarehousessuser7fc7eb
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricNathan Bijnens
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptRafiulHasan19
 
Ds03 data analysis
Ds03   data analysisDs03   data analysis
Ds03 data analysisDotNetCampus
 
Best Practices for Meeting State Data Management Objectives
Best Practices for Meeting State Data Management ObjectivesBest Practices for Meeting State Data Management Objectives
Best Practices for Meeting State Data Management ObjectivesEmbarcadero Technologies
 

Similar to Building Data Warehouse in SQL Server (20)

Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data warehouseold
Data warehouseoldData warehouseold
Data warehouseold
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Role of Database Management System in A Data Warehouse
Role of Database Management System in A Data Warehouse Role of Database Management System in A Data Warehouse
Role of Database Management System in A Data Warehouse
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
 
Datawarehouse org
Datawarehouse orgDatawarehouse org
Datawarehouse org
 
Data warehouse introduction
Data warehouse introductionData warehouse introduction
Data warehouse introduction
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI Options
 
Cognos datawarehouse
Cognos datawarehouseCognos datawarehouse
Cognos datawarehouse
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Ds03 data analysis
Ds03   data analysisDs03   data analysis
Ds03 data analysis
 
Best Practices for Meeting State Data Management Objectives
Best Practices for Meeting State Data Management ObjectivesBest Practices for Meeting State Data Management Objectives
Best Practices for Meeting State Data Management Objectives
 

More from Antonios Chatzipavlis

Workload Management in SQL Server 2019
Workload Management in SQL Server 2019Workload Management in SQL Server 2019
Workload Management in SQL Server 2019Antonios Chatzipavlis
 
Loading Data into Azure SQL DW (Synapse Analytics)
Loading Data into Azure SQL DW (Synapse Analytics)Loading Data into Azure SQL DW (Synapse Analytics)
Loading Data into Azure SQL DW (Synapse Analytics)Antonios Chatzipavlis
 
Building diagnostic queries using DMVs and DMFs
Building diagnostic queries using DMVs and DMFs Building diagnostic queries using DMVs and DMFs
Building diagnostic queries using DMVs and DMFs Antonios Chatzipavlis
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure Antonios Chatzipavlis
 
Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019Antonios Chatzipavlis
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure Antonios Chatzipavlis
 
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018 Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018 Antonios Chatzipavlis
 

More from Antonios Chatzipavlis (20)

Data virtualization using polybase
Data virtualization using polybaseData virtualization using polybase
Data virtualization using polybase
 
SQL server Backup Restore Revealed
SQL server Backup Restore RevealedSQL server Backup Restore Revealed
SQL server Backup Restore Revealed
 
Migrate SQL Workloads to Azure
Migrate SQL Workloads to AzureMigrate SQL Workloads to Azure
Migrate SQL Workloads to Azure
 
Machine Learning in SQL Server 2019
Machine Learning in SQL Server 2019Machine Learning in SQL Server 2019
Machine Learning in SQL Server 2019
 
Workload Management in SQL Server 2019
Workload Management in SQL Server 2019Workload Management in SQL Server 2019
Workload Management in SQL Server 2019
 
Loading Data into Azure SQL DW (Synapse Analytics)
Loading Data into Azure SQL DW (Synapse Analytics)Loading Data into Azure SQL DW (Synapse Analytics)
Loading Data into Azure SQL DW (Synapse Analytics)
 
Introduction to DAX Language
Introduction to DAX LanguageIntroduction to DAX Language
Introduction to DAX Language
 
Building diagnostic queries using DMVs and DMFs
Building diagnostic queries using DMVs and DMFs Building diagnostic queries using DMVs and DMFs
Building diagnostic queries using DMVs and DMFs
 
Exploring T-SQL Anti-Patterns
Exploring T-SQL Anti-Patterns Exploring T-SQL Anti-Patterns
Exploring T-SQL Anti-Patterns
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
SQLServer Database Structures
SQLServer Database Structures SQLServer Database Structures
SQLServer Database Structures
 
Sqlschool 2017 recap - 2018 plans
Sqlschool 2017 recap - 2018 plansSqlschool 2017 recap - 2018 plans
Sqlschool 2017 recap - 2018 plans
 
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018 Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
Azure SQL Database for the SQL Server DBA - Azure Bootcamp Athens 2018
 
Microsoft SQL Family and GDPR
Microsoft SQL Family and GDPRMicrosoft SQL Family and GDPR
Microsoft SQL Family and GDPR
 
Statistics and Indexes Internals
Statistics and Indexes InternalsStatistics and Indexes Internals
Statistics and Indexes Internals
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
Azure SQL Data Warehouse
Azure SQL Data Warehouse Azure SQL Data Warehouse
Azure SQL Data Warehouse
 
Introduction to azure document db
Introduction to azure document dbIntroduction to azure document db
Introduction to azure document db
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Building Data Warehouse in SQL Server

  • 1. th SQL NIGHT CHAPTER Building Data Warehouse in SQL Server Antonios Chatzipavlis SQLschool.gr Founder, Principal Consultant SQL Server Evangelist, MVP on SQL Server May 30, 2015 25
  • 2. I have been started with computers. I started my professional carrier in computers industry. I have been started to work with SQL Server version 6.0 I earned my first certification at Microsoft as Microsoft Certified Solution Developer (3rd in Greece) and started my carrier as Microsoft Certified Trainer (MCT) with more than 20.000 hours of training until now! I became for first time Microsoft MVP on SQL Server I created the SQL School Greece (www.sqlschool.gr) I became MCT Regional Lead by Microsoft Learning Program. I was certified as MCSE : Data Platform, MCSE: Business Intelligence Antonios Chatzipavlis Database Architect SQL Server Evangelist MCT, MCSE, MCITP, MCPD, MCSD, MCDBA, MCSA, MCTS, MCAD, MCP, OCA, ITIL-F 1982 1988 1996 1998 2010 2012 2013 CHAPTER
  • 3. Follow us in social media Twitter @antoniosch / @sqlschool Facebook fb/sqlschoolgr YouTube yt/user/achatzipavlis LinkedIn SQL School Greece group Pinterest pi/SQLschool/
  • 5. Stay Involved! • Sign up for a free membership today at sqlpass.org • Linked In: http://www.sqlpass.org/linkedin • Facebook: http://www.sqlpass.org/facebook • Twitter: @SQLPASS • PASS: http://www.sqlpass.org
  • 6. Whatever your data passion – there’s a Virtual Chapter for you! www.sqlpass.org/vc
  • 7. Planning on attending PASS Summit 2015? Start saving today! • The world’s largest gathering of SQL Server & BI professionals • Take your SQL Server skills to the next level by learning from the world’s top SQL Server experts, in over 190 technical sessions • Over 5000 registrations, representing 2000 companies, from 52 countries, ready to network & learn Save $150 right now using discount code LC15CPJ8 $1795 until July 12th, 2015
  • 8. Don’t miss your chance to vote in the 2015 PASS elections-update your myPASS profile by June 1! In order to vote for the 2015 PASS Nomination Committee & the Board of Directors, you need to complete all mandatory fields in your myPASS profile by 11:59 PM PDT June 1, 2015. • PASS members will be reminded to review & complete their profiles • Members will receive instructions for updating profiles and deleting duplicate profiles • Eligible voters will receive information about key election dates and the voting process after June 1 Head to sqlpass.org/myPASS today! For more info on elections, visit to sqlpass.org/elections
  • 9. • Overview of Data Warehousing • Data Warehouse Solution • Data Warehouse Infrastructure • Data Warehouse Hardware • Data Warehouse Design Overview • Designing Dimension Tables • Designing Fact Tables • Data Warehouse Physical Design Agenda
  • 10. Overview of Data Warehousing
  • 11. • There are many definitions for the term “data warehouse,” and disagreements over specific implementation details. • It is generally agreed that a data warehouse is a centralized store of business data that can be used for reporting and analysis to inform key decisions. • A data warehouse provides a solution to the problem of distributed data that prevents effective business decision- making. What is a Data Warehouse?
  • 12. The single organizational repository of enterprise wide data across many or all lines of business and subject areas. Contains massive and integrated data. Represents the complete organizational view of information needed to run and understand the business. Definition of Data Warehouse
  • 13. • Contains a large volume of data that relates to historical business transactions. • Is optimized for read operations that support querying the data. • Is loaded with new or updated data at regular intervals. • Provides the basis for enterprise BI applications. Data Warehouse characteristics
  • 14. • Finding the information required for business decision • This is time-consuming and error-prone. • Key business data is distributed across multiple systems. • This makes it hard to collate all the information necessary for a particular business decision. • Fundamental business questions are hard to answer. • Most business decisions require a knowledge of fundamental facts. • The distribution of data throughout multiple systems in a typical organization can make them difficult, or even impossible, to answer. What makes a Data Warehouse useful?
  • 15. • The specific, subject oriented, or departmental view of information from the organization. • Generally these are built to satisfy user requirements for information What is a Data Mart?
  • 16. Data Warehouse Vs Data Mart Data Warehouse Data Mart Scope • Application independent • Centralized or Enterprise • Planned • Specific application • Decentralized by group • Organic but may be planned Data • Historical, detailed, summary • Some denormalization • Some history, detailed, summary • High denormalization Subjects • Multiple Subjects • Single central subject area Sources • Many internal and external sources • Few internal and external sources Other • Flexible • Data oriented • Long life • Single complex structure • Restrictive • Project oriented • Short life • Multiple simple structures
  • 17. • Centralized Data Warehouse • Departmental Data Mart • Hub and Spoke Data Warehouse Architectures
  • 18. • Centralized Data Warehouse • Departmental Data Mart • Hub and Spoke Data Warehouse Architectures
  • 19. • Centralized Data Warehouse • Departmental Data Mart • Hub and Spoke Data Warehouse Architectures
  • 20. • Centralized Data Warehouse • Departmental Data Mart • Hub and Spoke Data Warehouse Architectures
  • 21. Components of a Data Warehousing Solution Data Warehouse Master Data Management Data Cleansing DataSources  ETL Data Models Reporting and Analysis
  • 22. 1. Start by identifying the business questions that the data warehousing solution must answer 2. Determine the data that is required to answer these questions 3. Identify data sources for the required data 4. Assess the value of each question to key business objectives versus the feasibility of answering it from the available data For large enterprise-level projects, an incremental approach can be effective: • Break the project down into multiple sub-projects • Each sub-project deals with a particular subject area in the data warehouse Starting a Data Warehouse Project
  • 23. Core Data Warehousing • SQL Server Database Engine • SQL Server Integration Services • SQL Server Master Data Services • SQL Server Data Quality Services Enterprise BI • SQL Server Analysis Services • SQL Server Reporting Services • Microsoft SharePoint Server • Microsoft Office Self-Service BI Big Data Analysis • Excel Add-ins (PowerPivot, Power Query, Power View, Power Map) • Microsoft Office 365 Power BI • Windows Azure HDInsight SQL Server As a Data Warehousing Platform
  • 25. A data warehouse is a relational database that is optimized for reading data for analysis and reporting. Keep in mind
  • 26. • Logical: • Is typically designed to denormalize data into a structure that minimizes the number of join operations required in the queries used to retrieve and aggregate data. • A common approach is to design a star schema • Physical: • Affect the performance and manageability of the data warehouse Logical and Physical Database schema
  • 27. • Query processing requirements, including anticipated peak memory and CPU utilization. • Storage volume and disk input/output requirements. • Network connectivity and bandwidth. • Component redundancy for high availability. Hardware selection
  • 28. • Failover time requirements. • Configuration and management complexity. • The volume of data in the data warehouse. • The frequency of changes to data in the data warehouse. • The effect of the backup process on data warehouse performance. • The time to recover the database in the event of a failure. High availability and Disaster Recovery
  • 29. • The authentication mechanisms that you must support to provide access to the data warehouse. • The permissions that the various users who access the data warehouse will require. • The connections over which data is accessed. • The physical security of the database and backup media. Security
  • 30. • Data Source Connection Types • Credentials and Permissions • Data Formats • Data Acquisition Windows Data sources
  • 31. • Staging: • What data must be staged? • Staging data format • Required transformations: • Transformations during extraction versus data flow transformations • Incremental ETL: • Identifying data changes for extraction • Inserting or updating when loading ETL Processes
  • 32. • Data quality: • Cleansing data: • Validating data values • Ensuring data consistency • Identifying missing values • Deduplicating data • Master data management: • Ensuring consistent business entity definitions across multiple systems • Applying business rules to ensure data validity Data Quality and Master Data Management
  • 34. Data volume • The amount of data that the data warehouse must store • The size and frequency of incremental loads of new data. • The primary consideration is the number of rows in fact tables • But don’t forget dimension data, indexes, and data models that are stored on disk. System Sizing Factors
  • 35. Analysis and Reporting Complexity • This includes the number, complexity, and predictability of the queries that will be used to analyze the data or produce reports. • Typically, BI solutions must support a mix of the following query types: • Simple. Relatively straightforward SELECT statements. • Medium. Repeatedly executed queries that include aggregations or many joins. • Complex. Unpredictable queries with complex aggregations, joins, and calculations. System Sizing Factors
  • 36. Number of Users • This is the total number of information workers who will access the system, and how many of them will do so concurrently. Availability Requirements • These include when the system will need to be used, and what planned or unplanned downtime the business can tolerate. System Sizing Factors
  • 37. Typical System Categorization Small Medium Large Data Volume 100s of GBs to 1 TB 1 to 10 TB 10 TB to 100s of TBs Analysis and Reporting Complexity Over 50% simple 30% medium Less than 10% complex 50% simple 30-35% medium 10-15% complex 30-35% simple 40% medium 20-25% complex Number of Users 100 total 10 to 20 concurrent 1,000 total 100 to 200 concurrent 1,000s of concurrent users Availability Requirements Business hours 1 hour of downtime per night 24/7 operations
  • 38. Data Warehouse Workloads ETL • Control flow tasks • Data query and insert • Network data transfer • In-memory data pipeline • SSIS Catalog or MSDB I/O Reporting • Client requests • Data source queries • Report rendering • Caching • Snapshot execution • Subscription processing • Report Server Catalog I/O Operations and Maintenance • OS activity • Logging • SQL Server Agent Jobs • SSIS packages • Indexes • Backups DW • Processing • Aggregation storage • Multidimensional on disk • Tabular in memory • Query execution Cubes
  • 39. Typical Server Topologies for a BI Solution Single Server Architecture DW Distributed Architecture ServersFew Many Hardware costs Software license costs Configuration complexity Scalability & Performance Flexibility
  • 40. Scaling-out a BI Solution Analysis ServicesData Warehouse Integration Services Reporting Services • Partitioning the data across multiple database servers • SQL Server Parallel Data Warehouse edition • Install the Reporting Services database on a single database server, • Then install the Reporting Services report server service on multiple servers that all connect to the same Reporting Services database. Create a read-only copy of a multidimensional database and connect to it from multiple Analysis Services query servers. Use multiple SSIS servers to perform a subset of the ETL processes in parallel
  • 41. Planning for High Availability • AlwaysOn Failover Cluster • RAID Storage • AlwaysOn Failover Cluster • AlwaysOn Availability Group • NLB Report Servers • AlwaysOn Availability Group • AlwaysOn Failover Cluster Data Warehouse Analysis Services Integration Services Reporting Services
  • 43. • A DW usually has longer-running queries • A DW has higher read activity than write activity • The data in DW is usually more static • In a DW it is much more important to be able to process a large amount of data quickly, than it is to support a high number of I/O operations per second Keep in mind
  • 44. • Determine initial data volume • Number of fact table rows x row size • Use 100 bytes per row as an estimate if unknown • Add 30-40% for dimensions and indexes • Project data growth • Number of new fact rows per month • Factor in compression • Typically 3:1 Determining Storage Requirements Other storage requirements • Configuration databases • Log files • TempDB • Staging tables • Backups • Analysis Services models
  • 45. • Use more smaller disks instead of fewer larger disks • Use the fastest disks you can afford • Consider solid state disks especially for random I/O • Use RAID 10, or minimally RAID 5 • Consider a dedicated storage area network for manageability and extensibility • Balance I/O across enclosures, storage processors, and disk groups Considerations for Storage Hardware
  • 46. Server size Minimum memory Maximum memory 1 socket 64 GB 128 GB 2 sockets 128 GB 256 GB 4 sockets 256 GB 512 GB 8 sockets 512 GB 768 GB Server Memory
  • 47. • Determine core MCR • Apply formula to estimate required number of cores: Estimating CPU Requirements 𝐶𝑃𝑈𝑠 = 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑞𝑢𝑒𝑟𝑦 𝑠𝑖𝑧𝑒 𝑖𝑛 𝑀𝐵 𝑀𝐶𝑅 𝑥 𝐶𝑜𝑛𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑢𝑠𝑒𝑟𝑠 𝑇𝑟𝑎𝑟𝑔𝑒𝑡 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑡𝑖𝑚𝑒
  • 48. • This metric measures the maximum SQL Server data processing rate for a standard query and data set for a specific server and CPU combination. • This is provided as a per-core rate, and it is measured as a query-based scan from memory cache. • MCR is the initial starting point for Fast Track system design. • It represents an estimated maximum required I/O bandwidth for the server, CPU, and workload. • MCR is useful as an initial design guide because it requires only minimal local storage and database schema to estimate potential throughput for a given CPU. • It is not a measure of system performance. Maximum Consumption Rate (MCR)
  • 49. • Create a reference dataset based on the TPC-H line item table or similar data set. • The table should be of a size that it can be entirely cached in the SQL Server buffer pool yet still maintain a minimum one-second execution time for the query provided here. • For FTDW the following query is used: SELECT sum([integer field]) FROM [table] WHERE [restrict to appropriate data volume] GROUP BY [col]. • Ensure that Resource Governor settings are at default values. • Ensure that the query is executing from the buffer cache. • Executing the query once should put the pages into the buffer, and subsequent executions should read fully from buffer. Validate that there are no physical reads in the query statistics output. Calculate MCR
  • 50. • Set STATISTICS IO and STATISTICS TIME to ON to output results. • Run the query multiple times, at MAXDOP = 1. • Record the number of logical reads and CPU time from the statistics output for each query execution. • Calculate the MCR in MB/s using the formula: ( [Logical reads] / [CPU time in seconds] ) * 8KB / 1024 • A consistent range of values (+/- 5%) should appear over a minimum of five query executions. • Significant outliers (+/- 20% or more) may indicate configuration issues. The average of at least 5 calculated results is the FTDW MCR. Calculate MCR
  • 51. • Special SQL Server Edition only available in hardware appliances • Massively parallel processing • Shared-nothing architecture • Dedicated control nodes, compute nodes, and storage nodes SQL Server Parallel Data Warehouse DualFiberChannel Database servers (compute nodes) Infiniband Storage Arrays Control Node Cluster Management Servers Landing Zone (ETL Interface) Backup Nodes
  • 53. The Dimensional Model Fact Dimension Dimension Dimension Dimension Dimension Dimension Snowflake schema Star schema Measures Attributes Attributes Attributes Attributes Attributes Attributes
  • 54. • Identify the grain • Select the required dimensions • Identify the facts Dimensional Modeling
  • 55. • The grain of a dimensional model is the lowest level of detail at which you can aggregate the measures. • It is important to choose the level of grain that will support the most granular of reporting and analytical requirements • Typically the lowest level possible from the source data is the best option. Identify the grain
  • 56. • Determine which of the dimensions related to the business process should be included in the model • The selection of dimensions depends on the reporting and analytical requirements, specifically on the business entities by which users need to aggregate the measures • Almost all dimensional models include a time-based dimension Select the required dimensions
  • 57. • Identify the facts that you want to include as measures. • Measures are numeric values that can be expressed at the level of the grain chosen earlier and aggregated across the selected dimensions. • Depending on the grain you choose for the dimensional model and the grain of the source data, you might need to allocate measures from a higher level of grain across multiple fact rows. Identify the facts
  • 58. Documenting Dimensional Models Sales Order Item Quantity Unit Cost Total Cost Unit Price Sales Amount Shipping Cost Time (Order Date and Ship Date) Salesperson CustomerProduct Calendar Year Month Date Fiscal Year Fiscal Quarter Month Date Region Country Territory Manager Name Name Country State or Province City Age Marital Status Gender Category Subcategory Product Name Color Size
  • 60. Each row in a dimension table represents an instance of a business entity by which the measures in the fact table can be aggregated Keep in mind
  • 61. • A key column uniquely identifies each row in the dimension table. • Usually the dimension data is obtained from a source system in which a key is already assigned, this is the “business key” • It is standard practice to define a new “surrogate key” that uses an integer value to identify each row. • A surrogate key is recommended for the following reasons: • The data warehouse might use dimension data from multiple source systems, so it is possible that business keys are not unique. • Some source systems use non-numeric keys, such as a globally unique identifier (GUID), or natural keys, such as an email address, to uniquely identify data entities. Integer keys are smaller and more efficient to use in joins from fact tables. • If the dimension table supports “Type 2” slowly-changing dimensions. Dimension keys
  • 62. Dimension keys ProductKey ProductAltKey ProductName Color Size 1 MB1-B-32 MB1 Mountain Bike Blue 32 2 MB1-R-32 MB1 Mountain Bike Red 32 CustomerKey CustomerAltKey Name 1 1002 Amy Alberts 2 1005 Neil Black Surrogate Key Business (Alternate) Key
  • 63. • Hierarchies • Multiple attributes can be combined to form hierarchies that enable users to drill down into deeper levels of detail. • Business users can view aggregated fact data at each level • Slicers • Attributes do not need to form hierarchies to be useful in analysis and reporting. • Business users can group or filter data based on single-level hierarchies to create analytical sub-groupings of data. • Drill-through detail • Some attributes have little value as slicers or members of a hierarchy. • It can be useful to include entity-specific attributes to facilitate drill-through functionality in reports or analytical applications. Dimension Attributes and Hierarchies
  • 64. Dimension Attributes and Hierarchies CustKey CustAltKey Name Country State City Phone Gender 1 1002 Amy Alberts Canada BC Vancouver 555 123 F 2 1005 Neil Black USA CA Irvine 555 321 M 3 1006 Ye Xu USA NY New York 555 222 M Hierarchy SlicerDrill-through detail
  • 65. • Identify the semantic meaning of NULL • Unknown or None? • Do not assume NULL equality • Use ISNULL( ) Unknown and None OrderNo Discount DiscountType 1000 1.20 Bulk Discount 1001 0.00 N/A 1002 2.00 1003 0.50 Promotion 1004 2.50 Other 1005 0.00 N/A 1006 1.50 Source DimensionTable DiscKey DiscAltKey DiscountType -1 Unknown Unknown 0 N/A None 1 Bulk Discount Bulk Discount 2 Promotion Promotion 3 Other Other
  • 66. • The simplest type of SCD to implement. • Attribute values are updated directly in the existing dimension table row and no history is maintained. • Suitable for attributes that are used to provide drill-through details • Unsuitable for analytical slicers or hierarchy members where historic comparisons must reflect the attribute values as they were at the time of the fact event. Slowly Changing Dimensions – Type 1 CustKey CustAltKey Name Phone 1 1002 Amy Alberts 555 123 CustKey CustAltKey Name Phone 1 1002 Amy Alberts 555 222
  • 67. • These changes involve the creation of a fresh version of the dimension entity in the form of a new row. • Typically, a bit column in the dimension table is used as a flag to indicate which version of the dimension row is the current one. • Additionally, datetime columns are often used to indicate the start and end of the period for which a version of the row was (or is) current. • Maintaining start and end dates makes it easier to assign the appropriate foreign key value to fact rows as they are loaded so they are related to the version of the dimension entity that was current at the time the fact occurred. Slowly Changing Dimensions – Type 2 CustKey CustAltKey Name City Current Start End 1 1002 Amy Alberts Vancouver Yes 1/1/2000 CustKey CustAltKey Name City Current Start End 1 1002 Amy Alberts Vancouver No 1/1/2000 1/1/2012 4 1002 Amy Alberts Toronto Yes 1/1/2012
  • 68. • Rarely used • The previous value (or a complete history of previous values) is maintained in the dimension table row. • This requires modifying the dimension table schema to accommodate new values for each tracked attribute, and can result in a complex dimension table that is difficult to manage. Slowly Changing Dimensions – Type 3 CustKey CustAltKey Name Cars 1 1002 Amy Alberts 0 CustKey CustAltKey Name Prior Cars Current Cars 1 1002 Amy Alberts 0 1
  • 69. • Surrogate key • Granularity • Range Time Dimension DateKey DateAltKey MonthDay WeekDay Day MonthNo Month Year 00000000 01-01-1753 NULL NULL NULL NULL NULL NULL 20130101 01-01-2013 1 3 Tue 01 Jan 2013 20130102 01-02-2013 2 4 Wed 01 Jan 2013 20130103 01-03-2013 3 5 Thu 01 Jan 2013 20130104 01-04-2013 4 6 Fri 01 Jan 2013 • Attributes and hierarchies • Multiple calendars • Unknown values
  • 70. • Create a Transact-SQL script • Use Microsoft Excel • Use a BI tool to autogenerate a time dimension table Populating a Time Dimension Table
  • 71. • A common requirement in a data warehouse is to support dimensions with parent-child hierarchies • Typically, parent-child hierarchies are implemented as self- referencing tables, in which a column in each row is used as a foreign-key reference to a primary-key value in the same table Self-Referencing Dimension EmployeeKey EmployeeAltKey EmployeeName ManagerKey 1 1000 Kim Abercrombie NULL 2 1001 Kamil Amireh 1 3 1002 Cesar Garcia 1 4 1003 Jeff Hay 2
  • 72. • Combine low-cardinality attributes that don’t belong in existing dimensions into a junk dimension • Avoids creating many small dimension tables Junk Dimensions JunkKey OutOfStockFlag FreeShippingFlag CreditOrDebit 1 1 1 Credit 2 1 1 Debit 3 1 0 Credit 4 1 0 Debit 5 0 1 Credit 6 0 1 Debit 7 0 0 Credit 8 0 0 Debit
  • 74. Fact Table Columns OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount 20120101 25 120 1000 1 350.99 20120101 99 120 1000 2 6.98 20120101 25 178 1001 2 701.98 OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount 20120101 25 120 1000 1 350.99 20120101 99 120 1000 2 6.98 20120101 25 178 1001 2 701.98 OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount 20120101 25 120 1000 1 350.99 20120101 99 120 1000 2 6.98 20120101 25 178 1001 2 701.98 Dimension Keys Measures Degenerate Dimensions
  • 75. Types of Measure OrderDateKey ProductKey CustomerKey SalesAmount 20120101 25 120 350.99 20120101 99 120 6.98 20120102 25 178 701.98 DateKey ProductKey StockCount 20120101 25 23 20120101 99 118 20120102 25 22 OrderDateKey ProductKey CustomerKey ProfitMargin 20120101 25 120 25 20120101 99 120 22 20120102 25 178 27 Additive measures Semi-additive measures Non-additive measures
  • 76. Types of Fact Table OrderDateKey ProductKey CustomerKey OrderNo Qty Cost SalesAmount 20120101 25 120 1000 1 125.00 350.99 20120101 99 120 1000 2 2.50 6.98 20120101 25 178 1001 2 250.00 701.98 DateKey ProductKey OpeningStock UnitsIn UnitsOut ClosingStock 20120101 25 25 1 3 23 20120101 99 120 0 2 118 OrderNo OrderDateKey ShipDateKey DeliveryDateKey 1000 20120101 20120102 20120105 1001 20120101 20120102 00000000 1002 20120102 00000000 00000000 Transaction fact tables Periodic snapshot tables Accumulating snapshot fact tables
  • 78. Understanding DW Components Activity ETL Data Models Reports User Queries ETL Loads • Bulk inserts • Some lookups and updates • Large fact tables • Star joins to dimension tables Data Model Processing • Mostly table/index scans Report Processing • Predictable queries • Many rows with range-based query filters Self-Service BI • Potentially unpredictable queries
  • 79. • Create files with an initial size • Based on the eventual size of the objects that will be stored on them • This pre-allocates sequential disk blocks and helps avoid fragmentation. • Disable autogrowth • If you begin to run out of space in a data file, it is more efficient to explicitly increase the file size by a large amount rather than rely on incremental autogrowth. Data files guidelines
  • 80. • Create at least one filegroup in addition to the primary one, and then set it as the default filegroup so you can separate data tables from system tables. • Create dedicated filegroups for extremely large fact tables and using them to place those fact tables on their own logical disks. • If some tables in the data warehouse are loaded on a different schedule from others, consider using filegroups to separate the tables into groups that can be backed up independently. • If you intend to partition a large fact table, create a filegroup for each one so that older, stable rows can be backed up, and then set as read-only. Filegroups guidelines
  • 81. • Separate staging database • Create it on a logical disk distinct from the data warehouse files. • Into the data warehouse database • Create a file and filegroup for them on a logical disk • Separate from the fact and dimension tables. • An exception to the previous guideline is made for staging tables that will be switched with partitions to perform fast loads. • These must be created on the same filegroup as the partition with which they will be switched. Staging tables
  • 82. • To avoid fragmentation of data files • Place it on a dedicated logical disk • Set its initial size based on how much it is likely to be used. • Set the growth increment to be quite large to ensure that performance is not interrupted by frequent growth of TempDB. • Creating multiple files for TempDB to help minimize contention during page free space (PFS) scans as temporary objects are created and dropped. TempDB
  • 83. • Set the transaction mode of the Data Warehouse, Staging Database and TempDB to Simple • Helps to avoid having to truncate transaction logs • Additionally, most of the inserts in a data warehouse are typically performed as bulk load operations which are not logged. • To avoid disk resource conflicts between data warehouse I/O and logging, place the transaction log files for all databases on a dedicated logical disk. Transaction logs
  • 84. • SQL Server Enterprise edition supports data compression at both page and row level. • Data compression benefits in a data warehouse • Reduced storage requirements. • Improved query performance • Best practices for data compression in a data warehouse • Use page compression on all dimension tables and fact table partitions. • If performance is CPU-bound, revert to row compression on frequently- accessed partitions. Data Compression
  • 85. • Improved query performance • More granular manageability • Improved data load performance • Best practices for partitioning in a DW • Partition Large Fact Tables • Partition on an incrementing date key • Design the partition scheme for ETL and manageability. • Maintain an empty partition at the start and end of the table Table Partitioning
  • 86. • Indexes maximize query performance • Planning Indexes is the most important part of database design process • Some inexperienced BI professionals are tempted to create many indexes on all tables to support queries. Indexes in DW
  • 87. • Create a clustered index on the surrogate key column. • This column is used to join the dimension table to fact tables, and a clustered index will help the query optimizer minimize the number of reads required to filter fact rows. • Create a non-clustered index on the alternate key column and include the SCD current flag, start date, and end date columns. • This index will improve the performance of lookup operations during ETL data loads that need to handle slowly-changing dimensions. • Create non-clustered indexes on frequently searched attributes, and consider including all members of a hierarchy in a single index. Dimension table indexes
  • 88. • Create a clustered index on the most commonly-searched date key. • Date ranges are the most common filtering criteria in most data warehouse workloads, so a clustered index on this key should be particularly effective in improving overall query performance. • Create non-clustered indexes on other, frequently-searched dimension keys. • Columnstore index on all columns Fact table indexes
  • 89. • Create a view for each dimension and fact table with NOLOCK query hint in the view definition • Create views with user-friendly view and column names • Do not include metadata columns in views • Create views to combine snowflake dimension tables • Partition-align indexed views • Use the SCHEMABINDING option • Security Using Views in a DW
  • 90.
  • 91. • Overview of Data Warehousing • Data Warehouse Solution • Data Warehouse Infrastructure • Data Warehouse Hardware • Data Warehouse Design Overview • Designing Dimension Tables • Designing Fact Tables • Data Warehouse Physical Design Summary