Partitioning is Dividing the large table and its indexes into smaller parts / partitions, so that maintenance operations can be applied on a partition-by-partition basis, rather than on the entire table.
Presentation on how to chat with PDF using ChatGPT code interpreter
Ssis partitioning and best practices
1. Table of Contents
SSIS Partitioning and Best Practices ............................................................................................................ 3
Sliding window .......................................................................................................................................... 4
Parallel Execution Using partition logic ................................................................................................ 4
SSIS Best Practices ........................................................................................................................................ 5
Benefits of using SSIS Partitioning ............................................................................................................ 7
Appendix ............................................................................................................................................... 7
1
2. SSIS Partitioning and Best
Practices
Date
27/1/2014
Owner
Vinod kumar kodatham
OBJECT OVERVIEW
Technical Name
Description
SSIS Partitioning and Best Practices.
Partitioning is Divides the large table and its indexes into smaller
parts / partitions, so that maintenance operations can be applied on
a partition-by-partition basis, rather than on the entire table.
2
3. SSIS Partitioning and Best Practices
Partitioning and Best Practices to be followed while developing SSIS ETLs to improve
Performance of the Packages.
Types of Partitions
•
Vertical partitioning
some columns in one table
other columns in some other table
•
Horizontal partitioning
Based on the rows range splitting the table
Requirements for Table Partition
•
Partition Function - Logical - Defines the points on the line (right or left)
Syntax : CREATE PARTITION FUNCTION [partfunc_TinyInt_MOD10](tinyint) AS
RANGE RIGHT FOR VALUES (0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09)
GO
ex:Creating a RANGE LEFT partition function on an int column
CREATE PARTITION FUNCTION myRangePF1 (int) AS RANGE LEFT FOR VALUES (1, 100, 1000);
Creating a RANGE RIGHT partition function on an int column
CREATE PARTITION FUNCTION myRangePF2 (int) AS RANGE RIGHT FOR VALUES (1, 100, 1000);
Syntax : CREATE PARTITION SCHEME [partscheme_DATA_TinyInt_MOD10]
AS PARTITION [partfunc_TinyInt_MOD10] TO ([DATA], [DATA], [DATA],
[DATA], [DATA], [DATA], [DATA], [DATA], [DATA], [DATA])
GO
•
Partitioned Key
Single Column or Computed Column which are marked Persisted
All data types for use as index columns are valid, except timestamp. LOB data types and CLR user defined types
cannot be used
Clustered table - must be part of either primary key or clustered index
Ideally queries should use them as filter
Partitioning Usage in Table
Create the table with PARTITION SCHEME
CREATE TABLE [tmp].[Table_1](
.
.
3
4. ) ON
[partscheme_DATA_TinyInt_MOD10]([MOD10])
Sliding window
1. Create a non partitioned archive table with the same structure, and a matching clustered index (if
required).
Place it on the same filegroup as the oldest partition.
2.
Use SWITCH to move the oldest partition from the partitioned table to the archive table.
3.
Remove the boundary value of the oldest partition using MERGE.
get smallest range vlaue from sys.partition_range_values and MERGE it
Syntax: ALTER PARTITION FUNCTION pf_k_rows()
MERGE RANGE (@merge_range)
4.Designate the NEXT USED filegroup.
5.
Create a new boundary value for the new partition using SPLIT (the best practice is to split an empty
partition at the leading end of the table into two empty partitions to minimize data movement.).
get largest range vlaue from sys.partition_range_values SPLIT last range with a new value
Syntax:SELECT @split_range = @split_range + 1000
ALTER PARTITION FUNCTION pf_k_rows()
SPLIT RANGE (@split_range)
6.Create a staging table that has the same structure as the partitioned table on the target filegroup.
7.Populate the staging table.
8.Add indexes.
9.Add a check constraint that matches the constraint of the new partition.
10.Ensure that all indexes are aligned.
11.Switch the newest data into the partitioned table (the staging table is now empty).
12.Update statistics on the partitioned table
Parallel Execution Using partition logic
Table data refresh time can be improved using partitioned parallel execution.
1.
Create PARTITION FUNCTION
2.
Create PARTITION SCHEME
3.
CREATE TABLE [dbo].[syslargevolumelog]
4.
Check If loading not at completed it will go down else go to step 8
5.
Create the table with PARTITION SCHEME
6.
Laod the TargetTable with SourceTable Using idcolumn/10=1 Etc...
7.
Update [syslargevolumelog] with data is loaded for this partition
8.
Create temporary table same as original table
4
5. 9.
Switch all partitions to temporary table
10.
Create unique clustered indexes
11.
Rename the temporary table as original table
SSIS Best Practices
Avoid SELECT *
Removing this unused output column can increase Data Flow task performance
Steps need to be considered while loading the data.
If any Non Clustered Index(es) exists
DROP all Non-Clustered Index(es)
If Clustered Index exists
DROP Clustered Index
Steps need to be considered while selecting the data.
If Clustered Index does not exists
CREATE Clustered Index
If Non Clustered Index(es) does exists
CREATE Non Clustered Index
Effect of OLEDB Destination Settings
Keep Identity – By default this setting is unchecked. If you check this setting, the dataflow engine will ensure that
the source identity values are preserved and same value is inserted into the destination table.
Keep Nulls –By default this setting is unchecked. If you check this option then default constraint on the
destination table's column will be ignored and preserved NULL of the source column will be inserted into the
destination.
Table Lock – By default this setting is checked and the recommendation is to let it be checked unless the same
table is being used by some other process at same time.
Check Constraints – Again by default this setting is checked and recommendation is to un-check it if you are sure
that the incoming data is not going to violate constraints of the destination table. If you un-check this option it
will improve the performance of the data load.
Better performance with parallel execution
MaxConcurrentExecutables – default value is -1, which means total number of available processors + 2, also if
you have hyper-threading enabled then it is total number of logical processors + 2.
Avoid asynchronous transformation (such as Sort Transformation) wherever possible
Ex: - Aggregate
- Fuzzy Grouping
- Merge
- Merge Join
5
6. - Sort
- Union All
How DelayValidation property can help you
In general the package will be validated during design time itself. However, we can control this behavior by using
"Delay Validation" property.
Default value of this property is false. By setting delay validation to true, we can delay validation of the package
until run time.
When to use events logging and when to avoid...
Recommendation here is to enable logging if required, you can dynamically set the value of the
LoggingMode property (of a package and its executables) to enable or disable logging without modifying the
package. Also you should choose to log for only those executables which you suspect to have problems and
further you should only log
those events which are absolutely required for troubleshooting.
Effect of Rows Per Batch and Maximum Insert Commit Size Settings
Rows per batch – The default value for this setting is -1 which specifies all incoming rows will be treated as a
single batch. You can change this default behavior and break all incoming rows into multiple batches. The allowed
value is only positive integer which specifies the maximum number of rows in a batch.
OLEDB Destination:
Maximum insert commit size – The default value for this setting is '2147483647' (largest value for 4 byte integer
type) which specifies all incoming rows will be committed once on successful completion. You can specify a
positive value for this setting to indicate that commit will be done for those number of records.
Changing the default value for this setting will put overhead on the dataflow engine to commit several times. Yes
that is true, but at the same time it will release the pressure on the transaction log and tempdb to grow
tremendously specifically during high volume data transfers.
DefaultBufferMaxSize and DefaultBufferMaxRows
The number of buffer created is dependent on how many rows fit into a buffer and how many rows fit into a
buffer dependent on few other factors.
1. Estimated row size,
2. DefaultBufferMaxSize property of the data flow task.default value is 10 MB and its upper and lower boundaries
are MaxBufferSize (100MB) and MinBufferSize (64 KB).
3. DefaultBufferMaxRows which is again a property of data flow task which specifies the default number of rows
in a buffer. Its default value is 10000.
Lookup transformation consideration
Choose the caching mode wisely after analyzing your environment.
If you are using Partial Caching or No Caching mode, ensure you have an index on the reference table for better
performance.
Instead of directly specifying a reference table in he lookup configuration, you should use a SELECT statement
with only the required columns.
You should use a WHERE clause to filter out all the rows which are not required for the lookup.
set data type in each column appropriately, especially if your source is flat file. This will enable you to
accommodate as many rows as possible in the buffer.
6
7. Avoid many small buffers. Tweak the values for DefaultMaxBufferRows and DefaultMaxBufferSize to get as many
records into a buffer as possible, especially when dealing with large data volumes.
Full Load vs Delta Load
Design the package in such a way that it does a full pull of data only in the beginning or on-demand, next time
onward it should do the incremental pull, this will greatly reduce the volume of data load operations, especially
when volumes are likely to increase over the lifecycle of an application. For this purpose, use upstream enabled
CDC (Change Data Capture) feature of SQL Server 2008; for previous versions of SQL Server incremental pull
logic.
Use merge instead of SCD
The big advantage of the MERGE statement is being able to handle multiple actions in a single pass of the data
sets, rather than requiring multiple passes with separate inserts and updates. A well tuned optimizer could handle
this extremely efficiently.
Packet size in connection should equal to 32767
Data types as narrow as possible for less memory usage
Do not perform excessive casting
Use group by instead of aggregation
Unnecessary delta detection vs. reload
commit size 0 == fastest
Benefits of using SSIS Partitioning
Following are some of the benefits of following SSIS Partitioning and Best Practices:
It facilitates the management of large fact tables in data warehouses.
Performance / parallelism benefits
Dividing the table into across file groups is benefitting on IO Operations, fetch latest data ,re indexing ,backup
and restore.
For range-based inserts or range-based deletes
Sliding window scenario
In SQL Server 2008 SP2 and SQL Server 2008 R2 SP1, you can choose to enable support for 15,000 partitions.
Appendix
Reference used for Best Practices:
http://msdn.microsoft.com/en-us/library/ms190787.aspx
http://www.mssqltips.com/sql_server_business_intelligence_tips.asp
7