• Save
Data Warehouse Best Practices
Upcoming SlideShare
Loading in...5
×
 

Data Warehouse Best Practices

on

  • 17,949 views

In this presentation we review the Best Practices for Data Warehouse Projects and Data Warehouse Design

In this presentation we review the Best Practices for Data Warehouse Projects and Data Warehouse Design

Statistics

Views

Total Views
17,949
Views on SlideShare
17,297
Embed Views
652

Actions

Likes
5
Downloads
1
Comments
1

25 Embeds 652

http://ecastrom.blogspot.com 336
http://universosql.blogspot.com 73
http://ecastrom.blogspot.mx 50
http://ecastrom.blogspot.com.es 36
http://todosobresqlserver.wordpress.com 34
http://ecastrom.wordpress.com 28
http://sqlservermvp.wordpress.com 21
http://todosobresql.blogspot.com 15
http://ecastrom.blogspot.com.ar 13
url_unknown 8
http://www.ecastrom.blogspot.com 8
http://universosql.blogspot.mx 7
http://sqlserverpedia.blogspot.com 6
http://todosobresql.blogspot.com.ar 3
http://todosobresql.blogspot.com.es 2
http://universosql.blogspot.com.ar 2
https://todosobresqlserver.wordpress.com 2
http://www.sqlserverpedia.blogspot.com 1
http://ecastrom.blogspot.ch 1
http://ecastrom.blogspot.com.au 1
http://todosobresql.blogspot.mx 1
http://www.linkedin.com 1
https://www.linkedin.com 1
http://ecastrom.blogspot.kr 1
http://translate.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • I think you might find that the statistics are bang uptodate after rebuilding an index, this has the affect of updating stats on the index with a fullscan. worse still if you update the stats you will will not only be repeating the work that the index rebuild has already done, but if this is not done with a full scan, you will get updated stats produced using a lower sampling percentage.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Warehouse Best Practices Data Warehouse Best Practices Presentation Transcript

  • DATAWAREHOUSEBEST PRACTICESDr. Eduardo Castro, MScecastro@simsasys.comhttp://ecastrom.blogspot.comhttp://comunidadwindows.orghttp://tiny.cc/comwindowsFacebook: ecastromTwitter: edocastro
  • SOURCESThis presentation is based on the following sourcesDatawarehouse Ravi RanJanTop 10 Best Practices for Building a Large Scale Relational Data Warehouse SQL CAT
  • Complexities of Creating a Data Warehouse • Incomplete errors • Missing Fields • Records or Fields That, by Design, are not Being Recorded • Incorrect errors • Wrong Calculations, Aggregations • Duplicate Records • Wrong Information Entered into Source SystemSource. Datawarehouse. Ravi RanJan
  • Data Warehouse Pitfalls • You are going to spend much time extracting, cleaning, and loading data • You are going to find problems with systems feeding the data warehouse • You will find the need to store/validate data not being captured/validated by any existing system • Large scale data warehousing can become an exercise in data homogenizingSource. Datawarehouse. Ravi RanJan
  • Data Warehouse Pitfalls… • The time it takes to load the warehouse will expand to the amount of the time in the available window... and then some • You are building a HIGH maintenance system • You will fail if you concentrate on resource optimization to the neglect of project, data, and customer management issues and an understanding of what adds value to the customerSource. Datawarehouse. Ravi RanJan
  • Best Practices • Complete requirements and design • Prototyping is key to business understanding • Utilizing proper aggregations and detailed data • Training is an on-going process • Build data integrity checks into your system.Source. Datawarehouse. Ravi RanJan
  • Top 10 Best Practices for Building a Large Scale Relational Data Warehouse • Building a large scale relational data warehouse is a complex task. • This section describes some design techniques that can help in architecting an efficient large scale relational data warehouse with SQL Server. • Most large scale data warehouses use table and index partitioning, and therefore, many of the recommendations here involve partitioning. • Most of these tips are based on experiences building large data warehouses on SQL ServerSource. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Consider partitioning large fact tables • Consider partitioning fact tables that are 50 to 100GB or larger. • Partitioning can provide manageability and often performance benefits. • Faster, more granular index maintenance. • More flexible backup / restore options. • Faster data loading and deleting • Faster queries when restricted to a single partition.. • Typically partition the fact table on the date key. • Enables sliding window. • Enables partition elimination.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Build clustered index on the date key of the fact table • This supports efficient queries to populate cubes or retrieve a historical data slice. • If you load data in a batch window for the clustered index on the fact table then use the options ALLOW_ROW_LOCKS = OFF and ALLOW_PAGE_LOCKS = OFF • This helps speed up table scan operations during query time and helps avoid excessive locking activity during large updates.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Build clustered index on the date key of the fact table • Build nonclustered indexes for each foreign key. • This helps ‘pinpoint queries to extract rows based on a selective dimension predicate. • Use filegroups for administration requirements such as backup / restore, partial database availability, etc.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Choose partition grain carefully • Most customers use month, quarter, or year. • For efficient deletes, you must delete one full partition at a time. • It is faster to load a complete partition at a time. • Daily partitions for daily loads may be an attractive option. • However, keep in mind that a table can have a maximum of 1000 partitions. • Partition grain affects query parallelism.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Choose partition grain carefully • For SQL Server 2005: • Queries touching a single partition can parallelize up to MAXDOP (maximum degree of parallelism). • Queries touching multiple partitions use one thread per partition up to MAXDOP.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Choose partition grain carefully • For SQL Server 2008: • Parallel threads up to MAXDOP are distributed proportionally to scan partitions, and multiple threads per partition may be used even when several partitions must be scanned. • Avoid a partition design where only 2 or 3 partitions are touched by frequent queries, if you need MAXDOP parallelism (assuming MAXDOP =4 or larger).Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Design dimension tables appropriately • Use integer surrogate keys for all dimensions, other than the Date dimension. • Use the smallest possible integer for the dimension surrogate keys. This helps to keep fact table narrow. • Use a meaningful date key of integer type derivable from the DATETIME data type (for example: 20060215). • Dont use a surrogate Key for the Date dimensionSource. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Design dimension tables appropriately • Build a clustered index on the surrogate key for each dimension table • Build a non-clustered index on the Business Key (potentially combined with a row-effective-date) to support surrogate key lookups during loads. • Build nonclustered indexes on other frequently searched dimension columns. • Avoid partitioning dimension tables.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Design dimension tables appropriately • Avoid enforcing foreign key relationships between the fact and the dimension tables, to allow faster data loads. • You can create foreign key constraints with NOCHECK to document the relationships; but don’t enforce them. • Ensure data integrity though Transform Lookups, or perform the data integrity checks at the source of the data.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Write effective queries for partition elimination • Whenever possible, place a query predicate (WHERE condition) directly on the partitioning key (Date dimension key) of the fact table.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Use Sliding Window technique to maintain data • Maintain a rolling time window for online access to the fact tables. Load newest data, unload oldest data. • • Always keep empty partitions at both ends of the partition range to guarantee that the partition split (before loading new data) and partition merge (after unloading old data) do not incur any data movement. • Avoid split or merge of populated partitions. Splitting or merging populated partitions can be extremely inefficient, as this may cause as much as 4 times more log generation, and also cause severe locking.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Use Sliding Window technique to maintain data • Create the load staging table in the same filegroup as the partition you are loading. • Create the unload staging table in the same filegroup as the partition you are deleteing. • It is fastest to load newest full partition at one time, but only possible when partition size is equal to the data load frequency (for example, you have one partition per day, and you load data once per day).Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Use Sliding Window technique to maintain data • If the partition size doesnt match the data load frequency, incrementally load the latest partition. • Various options for loading bulk data into a partitioned table are discussed in the whitepaper • http://www.microsoft.com/technet/prodtechnol/sql/be stpractice/loading_bulk_data_partitioned_table.mspx. • Always unload one partition at a time.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Efficiently load the initial data • Use SIMPLE or BULK LOGGED recovery model during the initial data load. • Create the partitioned fact table with the Clustered index. • Create non-indexed staging tables for each partition, and separate source data files for populating each partition. • Populate the staging tables in parallel. • Use multiple BULK INSERT, BCP or SSIS tasks.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Efficiently load the initial data • Create as many load scripts to run in parallel as there are CPUs, if there is no IO bottleneck. If IO bandwidth is limited, use fewer scripts in parallel. • Use 0 batch size in the load. Use 0 commit size in the load. • Use TABLOCK. • Use BULK INSERT if the sources are flat files on the same server. Use BCP or SSIS if data is being pushed from remote machines.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Efficiently load the initial data • Build a clustered index on each staging table, then create appropriate CHECK constraints. • SWITCH all partitions into the partitioned table. • Build nonclustered indexes on the partitioned table. • Possible to load 1 TB in under an hour on a 64-CPU server with a SAN capable of 14 GB/Sec throughput (non- indexed table).Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Efficiently delete old data • Use partition switching whenever possible. • To delete millions of rows from nonpartitioned, indexed tables • Avoid DELETE FROM ...WHERE ... • Huge locking and logging issues • Long rollback if the delete is canceled • Usually faster to • INSERT the records to keep into a non-indexed table • Create index(es) on the table • Rename the new table to replace the originalSource. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Efficiently delete old data • As an alternative, ‘trickle deletes using the following repeatedly in a loop DELETE TOP (1000) ... ; COMMIT • Another alternative is to update the row to mark as deleted, then delete later during non critical time.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Manage statistics manually • Statistics on partitioned tables are maintained for the table as a whole. • Manually update statistics on large fact tables after loading new data. • Manually update statistics after rebuilding index on a partition. • If you regularly update statistics after periodic loads, you may turn off autostats on that table.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Manage statistics manually • This is important for optimizing queries that may need to read only the newest data. • Updating statistics on small dimension tables after incremental loads may also help performance. • Use FULLSCAN option on update statistics on dimension tables for more accurate query plans.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Consider efficient backup strategies • Backing up the entire database may take significant amount of time for a very large database. • For example, backing up a 2 TB database to a 10-spindle RAID-5 disk on a SAN may take 2 hours (at the rate 275 MB/sec). • Snapshot backup using SAN technology is a very good option. • Reduce the volume of data to backup regularly.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • Consider efficient backup strategies • The filegroups for the historical partitions can be marked as READ ONLY. • Perform a filegroup backup once when a filegroup becomes read-only. • Perform regular backups only on the read / write filegroups. • Note that RESTOREs of the read-only filegroups cannot be performed in parallel.Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT