1. 1. Table of Contents SSIS Partitioning and Best Practices ............................................................................................................ 3 Sliding window .......................................................................................................................................... 4 Parallel Execution Using partition logic ................................................................................................ 4 SSIS Best Practices ........................................................................................................................................ 5 Benefits of using SSIS Partitioning ............................................................................................................ 7 Appendix ............................................................................................................................................... 7 1
2. 2. SSIS Partitioning and Best Practices Date 27/1/2014 Owner Vinod kumar kodatham OBJECT OVERVIEW Technical Name Description SSIS Partitioning and Best Practices. Partitioning is Divides the large table and its indexes into smaller parts / partitions, so that maintenance operations can be applied on a partition-by-partition basis, rather than on the entire table. 2
3. 3. SSIS Partitioning and Best Practices Partitioning and Best Practices to be followed while developing SSIS ETLs to improve Performance of the Packages. Types of Partitions • Vertical partitioning some columns in one table other columns in some other table • Horizontal partitioning Based on the rows range splitting the table Requirements for Table Partition • Partition Function - Logical - Defines the points on the line (right or left) Syntax : CREATE PARTITION FUNCTION [partfunc_TinyInt_MOD10](tinyint) AS RANGE RIGHT FOR VALUES (0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09) GO ex:Creating a RANGE LEFT partition function on an int column CREATE PARTITION FUNCTION myRangePF1 (int) AS RANGE LEFT FOR VALUES (1, 100, 1000); Creating a RANGE RIGHT partition function on an int column CREATE PARTITION FUNCTION myRangePF2 (int) AS RANGE RIGHT FOR VALUES (1, 100, 1000); Syntax : CREATE PARTITION SCHEME [partscheme_DATA_TinyInt_MOD10] AS PARTITION [partfunc_TinyInt_MOD10] TO ([DATA], [DATA], [DATA], [DATA], [DATA], [DATA], [DATA], [DATA], [DATA], [DATA]) GO • Partitioned Key Single Column or Computed Column which are marked Persisted All data types for use as index columns are valid, except timestamp. LOB data types and CLR user defined types cannot be used Clustered table - must be part of either primary key or clustered index Ideally queries should use them as filter Partitioning Usage in Table Create the table with PARTITION SCHEME CREATE TABLE [tmp].[Table_1]( . . 3
4. 4. ) ON [partscheme_DATA_TinyInt_MOD10]([MOD10]) Sliding window 1. Create a non partitioned archive table with the same structure, and a matching clustered index (if required). Place it on the same filegroup as the oldest partition. 2. Use SWITCH to move the oldest partition from the partitioned table to the archive table. 3. Remove the boundary value of the oldest partition using MERGE. get smallest range vlaue from sys.partition_range_values and MERGE it Syntax: ALTER PARTITION FUNCTION pf_k_rows() MERGE RANGE (@merge_range) 4.Designate the NEXT USED filegroup. 5. Create a new boundary value for the new partition using SPLIT (the best practice is to split an empty partition at the leading end of the table into two empty partitions to minimize data movement.). get largest range vlaue from sys.partition_range_values SPLIT last range with a new value Syntax:SELECT @split_range = @split_range + 1000 ALTER PARTITION FUNCTION pf_k_rows() SPLIT RANGE (@split_range) 6.Create a staging table that has the same structure as the partitioned table on the target filegroup. 7.Populate the staging table. 8.Add indexes. 9.Add a check constraint that matches the constraint of the new partition. 10.Ensure that all indexes are aligned. 11.Switch the newest data into the partitioned table (the staging table is now empty). 12.Update statistics on the partitioned table Parallel Execution Using partition logic Table data refresh time can be improved using partitioned parallel execution. 1. Create PARTITION FUNCTION 2. Create PARTITION SCHEME 3. CREATE TABLE [dbo].[syslargevolumelog] 4. Check If loading not at completed it will go down else go to step 8 5. Create the table with PARTITION SCHEME 6. Laod the TargetTable with SourceTable Using idcolumn/10=1 Etc... 7. Update [syslargevolumelog] with data is loaded for this partition 8. Create temporary table same as original table 4
5. 5. 9. Switch all partitions to temporary table 10. Create unique clustered indexes 11. Rename the temporary table as original table SSIS Best Practices Avoid SELECT * Removing this unused output column can increase Data Flow task performance Steps need to be considered while loading the data. If any Non Clustered Index(es) exists DROP all Non-Clustered Index(es) If Clustered Index exists DROP Clustered Index Steps need to be considered while selecting the data. If Clustered Index does not exists CREATE Clustered Index If Non Clustered Index(es) does exists CREATE Non Clustered Index Effect of OLEDB Destination Settings Keep Identity – By default this setting is unchecked. If you check this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table. Keep Nulls –By default this setting is unchecked. If you check this option then default constraint on the destination table's column will be ignored and preserved NULL of the source column will be inserted into the destination. Table Lock – By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at same time. Check Constraints – Again by default this setting is checked and recommendation is to un-check it if you are sure that the incoming data is not going to violate constraints of the destination table. If you un-check this option it will improve the performance of the data load. Better performance with parallel execution MaxConcurrentExecutables – default value is -1, which means total number of available processors + 2, also if you have hyper-threading enabled then it is total number of logical processors + 2. Avoid asynchronous transformation (such as Sort Transformation) wherever possible Ex: - Aggregate - Fuzzy Grouping - Merge - Merge Join 5
6. 6. - Sort - Union All How DelayValidation property can help you In general the package will be validated during design time itself. However, we can control this behavior by using "Delay Validation" property. Default value of this property is false. By setting delay validation to true, we can delay validation of the package until run time. When to use events logging and when to avoid... Recommendation here is to enable logging if required, you can dynamically set the value of the LoggingMode property (of a package and its executables) to enable or disable logging without modifying the package. Also you should choose to log for only those executables which you suspect to have problems and further you should only log those events which are absolutely required for troubleshooting. Effect of Rows Per Batch and Maximum Insert Commit Size Settings Rows per batch – The default value for this setting is -1 which specifies all incoming rows will be treated as a single batch. You can change this default behavior and break all incoming rows into multiple batches. The allowed value is only positive integer which specifies the maximum number of rows in a batch. OLEDB Destination: Maximum insert commit size – The default value for this setting is '2147483647' (largest value for 4 byte integer type) which specifies all incoming rows will be committed once on successful completion. You can specify a positive value for this setting to indicate that commit will be done for those number of records. Changing the default value for this setting will put overhead on the dataflow engine to commit several times. Yes that is true, but at the same time it will release the pressure on the transaction log and tempdb to grow tremendously specifically during high volume data transfers. DefaultBufferMaxSize and DefaultBufferMaxRows The number of buffer created is dependent on how many rows fit into a buffer and how many rows fit into a buffer dependent on few other factors. 1. Estimated row size, 2. DefaultBufferMaxSize property of the data flow task.default value is 10 MB and its upper and lower boundaries are MaxBufferSize (100MB) and MinBufferSize (64 KB). 3. DefaultBufferMaxRows which is again a property of data flow task which specifies the default number of rows in a buffer. Its default value is 10000. Lookup transformation consideration Choose the caching mode wisely after analyzing your environment. If you are using Partial Caching or No Caching mode, ensure you have an index on the reference table for better performance. Instead of directly specifying a reference table in he lookup configuration, you should use a SELECT statement with only the required columns. You should use a WHERE clause to filter out all the rows which are not required for the lookup. set data type in each column appropriately, especially if your source is flat file. This will enable you to accommodate as many rows as possible in the buffer. 6
7. 7. Avoid many small buffers. Tweak the values for DefaultMaxBufferRows and DefaultMaxBufferSize to get as many records into a buffer as possible, especially when dealing with large data volumes. Full Load vs Delta Load Design the package in such a way that it does a full pull of data only in the beginning or on-demand, next time onward it should do the incremental pull, this will greatly reduce the volume of data load operations, especially when volumes are likely to increase over the lifecycle of an application. For this purpose, use upstream enabled CDC (Change Data Capture) feature of SQL Server 2008; for previous versions of SQL Server incremental pull logic. Use merge instead of SCD The big advantage of the MERGE statement is being able to handle multiple actions in a single pass of the data sets, rather than requiring multiple passes with separate inserts and updates. A well tuned optimizer could handle this extremely efficiently. Packet size in connection should equal to 32767 Data types as narrow as possible for less memory usage Do not perform excessive casting Use group by instead of aggregation Unnecessary delta detection vs. reload commit size 0 == fastest Benefits of using SSIS Partitioning Following are some of the benefits of following SSIS Partitioning and Best Practices: It facilitates the management of large fact tables in data warehouses. Performance / parallelism benefits Dividing the table into across file groups is benefitting on IO Operations, fetch latest data ,re indexing ,backup and restore. For range-based inserts or range-based deletes Sliding window scenario In SQL Server 2008 SP2 and SQL Server 2008 R2 SP1, you can choose to enable support for 15,000 partitions. Appendix Reference used for Best Practices: http://msdn.microsoft.com/en-us/library/ms190787.aspx http://www.mssqltips.com/sql_server_business_intelligence_tips.asp 7