SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING

SKILLWISE-SSIS DESIGN PATTERN
FOR
DATA WAREHOUSING

SSIS Design Patterns for Data
Warehousing and Change Data
Capture

What is a Design Pattern?
• Pattern –A design for a package that solves a certain
scenario
• Over time certain SSIS logic flows have emerged as
best practices
• These designs have been classified into patterns for
reference purposes
• Standard Design Patterns
– Learn from others
– Common patterns make it easy for new personnel to
understand and work with
– Easy to apply in new projects

Design Patterns and Data
Warehousing
• SSIS most commonly used in Data Warehousing
• Patterns in this course most commonly used in
Data Warehousing
• Applicable to non DW projects
• Definitions
– Type 1 –Dimension updates simply overwrite pre-
existing values
– Type 2 –Each update to a dimension causes a new
record to be created
– Fact –Records the measures for a transaction and
associates with dimensions

What You Need?
• SQL Server Data Tools –BI Components
– For SQL Server 2012 use Visual Studio 2012
– For SQL Server 2014 use Visual Studio 2013
• SQL Server Data Tools –Database Project –SQL Server 2012
– Uses Visual Studio 2012
• SQL Server Data Tools –Database Project –SQL Server 2014
– Included in Visual Studio 2013 Community Edition
– Included in other versions of VS 2013 out of box
– Make sure to install update 4

Versions of SQL Server
• We will use SQL Server 2014 Project Deployment Mode
• Material works identically in 2012 (Project Deployment
Mode)
– Package Deployment Mode for 2012/2014 requires older
style configurations for Master/Child
• Patterns applicable in 2008R2 & 2008 with limitations
– CDC has to be manually implemented, no controls in SSIS
Toolbox
• Master / Child works differently –uses configurations
• Limited applicability to SQL Server 2005
– No Hashbytes
– No Merge
– No CDC
– Master / Child works differently –uses configurations

Deploying the Test Database
• Before running project you will need to
deploy and setup the test database
• Uses SSDT Database Project as part of the
solution
• Deploy a database
• After deploy run stored procedure
DDL.CreateAllObjects

The 13 Patterns• Truncate and Load
• SCD Wizard
– Type 1
– Type 2
• Set Based Updates
– Type 1
– Type 2
• Hashbytes
– Different Databases
– Same Database
• Change Data Capture
• Merge
• Date Based
• Fact Table Pattern
• Master / Child
– Basic
– Passing Parameters
– Load Balancing

Truncate and Load
• Deletes all rows in target, then
completely reloads from source
• Commonly used in staging environments,
often with other patterns
• Pros
– Simple to implement
– Fast for small to medium datasets
• Cons
– No change tracking
– Slower for large datasets

SCD Wizard Type 1
• SCD (Slowly Changing Dimension) Wizard
• Pattern for tables with Type 1 attributes only
• Pros
– Easy to create
– Good for very very small updates
• Cons
– When something changes all SCD generated
components must be deleted and recreated.
– Incredibly slow.
– Did we mention it is slow?
– It is really really slow.

SCD Wizard Type 2
• SCD (Slowly Changing Dimension) Wizard
• Pattern for tables with Type 1 and 2 attributes
• Wizard is same for both patterns, just different
options
• Pros
– Easy to create
– Good for very small updates
• Cons
– When something changes all SCD generated
components must be deleted and recreated.
– Incredibly slow.
– It didn’t get any faster since the last section

Set Based Updates –Type 1
• Set Based Updates –Type 1
• Pros
– Scales well
– Runs fast
• Cons
– Requires extra tables in the database
– Requires more setup work in the
package

Set Based Updates –Type 2
• Set Based Updates –Type 2
• Pros
– Scales well
– Runs fast
• Cons
– Logic is somewhat complex
– Requires extra tables in the database

Hashbytes –Different Databases
• Uses the Hashbytes function to generate a
unique value for comparisons
• Pros
– Good for tables with many columns
– Scales well -fast
• Cons
– Requires use of lookups –caching requires memory
– Requires concatenation of all data columns in select
statement

Hashbytes –Same Database
• Uses Hashbytes with a Merge Join
• Pros
– Avoids use of lookups, lowers memory
requirements
– Scales very well
– Will work on different databases but most
efficient in a single database
• Cons
– Requires data sources to be sorted
– Requires common key to sort on
– Needs to concatenate data columns for
Hashbytes

Change Data Capture
• Lets SQL Server track which rows in source have
changed
• Pros
– Tracks changes to data
– Only read rows which have changed
– Easy to determine Create / Update / Delete actions
• Cons
– Only works with SQL Server
– Requires setup work in database and tables before it
can be used
– Must have ability to alter the source system

Merge
• Uses SQL Server MERGE statement
• Pros
– Very fast
• Cons
– No transformations
– No ability to track progress

Date Based
• Uses date driven values to determine
changes
• Pros
– Easy to determine changes to rows
– Reduces number of rows that are read
– Can be combined with any of the other patterns
• Cons
– Requires source system to have a reliable date
field indicating changes
– Still requires logic to determine new rows vs
updates

Fact Table Pattern
• Used to update metrics /
measurements in data warehouse
• Pros
– Common pattern
– Easy to implement
• Cons
– Can require many lookups
– Updates not always simple

Master / Child (Basic)
• A master (parent) package which coordinates
the execution of other packages (children)
• Pros
• Cons
– Not always efficient when many packages are
involved

Master / Child (Parameters)
• Passing values from master package to children
• Pros
– SQL Server 2012 / 2014 project deployment mode
make it very easy to pass values
– Easy to reuse values across multiple child packages
• Cons
– In package deployment mode, or SQL Server 2008R2
and previous requires the more complex
configurations

Master/Child (Load Balanced)
• Uses a table to drive package execution
• Pros
– Easy to alter execution –just update a table
– Can easily balance parallel execution of
packages
• Cons
– Needs many variables
– Requires some manual effort and
monitoring to effectively balance

Choosing a Pattern
• Truncate and Load
– Low to moderate number of rows
– No requirement to track changes
– Good for staging tables
• SCD Wizard, Type 1 & 2
– Very small number of rows (< 2000)
– Packages that won’t change
– There is almost always a better pattern

Choosing a Pattern
• Set Based Updates, Type 1 & 2
– Scales well
– Good for limited number of columns
– Extra ram required
• Hashbytes
– Scales well
– Good for large number of columns
– Source systems needs to implement a form of
the Hashbytes function
• Change Data Capture
– Excellent pattern –SQL Server tells you all
changes
– Data source must be SQL Server

Choosing a Pattern
• Merge
– Good for very simple ETL when no
monitoring is required
• Date Based
– Limits number of rows read in
– Can be combined with other patterns
• Fact Table Pattern
• Master / Child
– Basic
– Passing Parameters
– Load Balancing

Introduction
• Many applications have requirements for
identifying data changes that have occurred in a
database for various reasons
– Tracking historical changes to data
– Auditing changes to data
– Synchronizing data changes across disconnected
systems
– Implementing an Operational Data Store (ODS)
– Incremental data loading for a Data Warehouse
• SQL Server provides many techniques for tracking
changes to data

Techniques for Identifying Changes
• DML table triggers
– Can track before and after row state and deleted rows
– Can be customized to include the user, modification
time, or input buffer
– Can introduce significant performance overhead on
transactional systems
• Modified datetime or timestamp column
– Can introduce performance overhead to pull changes
– Does not track deleted rows
– Requires schema modifications to source tables and
code to set value

Techniques for Identifying Changes
• Data comparisons
– Comparing source and destination data requires
scanning all rows to
– determine changes and introduces significant
performance overhead
• Replication and Subscriber triggers
– Offloads change identification to subscription
database
– Requires customizations and manual management
of schema changes

Change Data Capture Solution
• Change Data Capture provides information about DML
changes to a table in near real-time using the same Log
Reader Agent as transactional replication
– Eliminates expensive techniques that require schema
modifications
• DML triggers
• Timestamp columns
• Data comparisons and complex JOIN queries
• May be used to answer a variety of critical questions:
– What are all of the changes that happened to a table since
the last ETL?
– Which columns changed?
– What type of changes occurred? INSERT/UPDATE/DELETE?
– What was the before image of a row that was modified?
• Supports net change identification, with a performance
trade-off due to an additional index on change tables

Configuring Change Data Capture
• Change Data Capture provides the ability to
capture the row data from DML changes to a
database when enabled for capture
• Configuring Change Data Capture has specific
requirements, which when met allows individual
tables to be configured for change capture
• The options for configuring a table for change
capture affect performance, the data collected,
and security controls for accessing the capture
tables

Requirements to Enable CDC
• Enterprise Edition feature only
• Enabling CDC for a database requires
sysadmin privileges
– Requires executing sp_cdc_enable_db
• Enabling CDC for a table requires
db_owner privileges in the database
– Requires executing sp_cdc_enable_table for
each table that will track changes within the
database
• Querying the results from the CDC tables
requires membership within the database
role specified in the sp_cdc_enable_table
procedure call if specified

sp_cdc_enable_table Options
• @source_schema – the name of the source table schema
(required)
• @source_name – the name of the source table (required)
• @role_name – the name of the database security role used for
gaiting access to the change data (required but can explicitly be
set to NULL)
• @capture_instance – the name of the capture instance
(optional)
• @supports_net_changes – indicates whether querying for net
changes is supported by the capture instance (optional defaults
1)
– Enabling net change support adds an additional non-clustered index to
the capture table, which can impact insert performance for change
rows
• @index_name – the name of a unique index to uniquely
identify rows in the source table (optional defaults to primary
key)

sp_cdc_enable_table Options
• @captured_column_list – the list of source table
columns to include in the change table (optional)
• @filegroup_name – the filegroup to be used for the
change table (optional)
• @allow_partition_switch – indicates whether the
SWITCH PARTITION command of ALTER TABLE can
be executed against a table that is enabled for CDC
(optional)
– Switching a partition into a CDC enabled table does
not generate INSERT change data for rows that
previously existed in the partition prior to the switch
– Switching a partition out of a CDC enabled table does
not generate DELETE change data for the rows
contained within the partition

QUERYING CHANGE DATA USING
T-SQL

Introduction
• After enabling a database for Change Data
Capture and configuring capture instances for the
source tables, the change data must be queried
for processing
• All change rows are identified by the Log
Sequence Number (LSN) associated with the
transaction that changed the row
• Change tables include internal metadata columns
that describe the change row as well as the
captured columns configured for the table

Finding Change Table
Metadata
• Stored Procedures
– sys.sp_cdc_help_change_data_capture –
returns the CDC capture information for each
table enabled within a database
• May return up to two rows per table, one for each
capture instance
• The @source_schema parameter specifies the
source schema to return results for when the
procedure executes
• The @source_name parameter specifies the source
table to return results for when the procedure
executes
– sys.sp_cdc_get_captured_columns – returns
the captured columns for the capture instance
specified by the @capture_instance parameter

Finding Change Table
Metadata
• System tables
– cdc.change_tables – contains up to two
rows, one per capture instance enabled
on a source table
– cdc.captured_columns – contains one
row per captured column for a source
capture instance
• Querying system tables is not
recommended, and using the stored
procedures is recommended instead

Understanding Change Table
Columns• Change tables exist within the CDC schema and are named
<capture_instance>_CT
• The first five columns are metadata columns:
– __$start_lsn – the starting LSN of the transaction
– __$end_lsn – the ending LSN of the transaction
– __$seqval – the sequence or order of the row changes within a
transaction
– __$operation – the type of operation reflected by the change row
• 1 = Delete
• 2 = Insert
• 3 = Value before update
• 4 = Value after update
– __$update_mask – bitmask of columns changed by operation within
row
• Remaining columns match the source table column definition
when the capture instance was created

Determining Change Rows to
Process
• By LSN:
– sys.fn_cdc_map_time_to_lsn ( '<relational_operator>',
tracking_time )
• Returns the LSN value from the start_lsn column of the
cdc.lsn_time_mapping system table for the tracking_time specified
• The relational_operator specifies the comparison to be applied against
the tran_end_time of the cdc_lsn_time_mapping table when
determining the LSN value to return
– largest less than, largest less than or equal, smallest greater than, or
smallest greater than or equal
– sys.fn_cdc_get_min_lsn ( 'capture_instance_name' )
• Returns the start_lsn value for the capture instance from
cdc.change_tables
• Sets the lower endpoint for change data for a given capture instance
– sys.fn_cdc_get_max_lsn ()
• Returns the maximum start_lsn column value from
cdc.lsn_time_mapping system table setting the upper endpoint for all
capture instances

Determining Change Rows to
Process• By LSN:
– Custom tracking table updated by application code to track
the capture instance name and last processed LSN
– sys.fn_cdc_decrement_lsn ( lsn_value )
• Returns the previous LSN in the sequence based upon the
specified LSN
• Often used to decrement sys.fn_cdc_get_max_lsn () value to set
upper endpoint without overlapping LSNs across different data
loads
– sys.fn_cdc_increment_lsn ( lsn_value )
• Returns the next LSN in the sequence based upon the specified
LSN
• Often used to increment the last saved LSN from a custom
tracking table to set a new lower endpoint without overlapping
LSNs across different data loads
• By time:
– sys.fn_cdc_map_lsn_to_time ( lsn_value )
• Returns the tran_end_time column from cdc.lsn_time_mapping
for the specified LSN, allowing

Change Row Table-Valued
Functions
• cdc.fn_cdc_get_all_changes_<capture_ins
tance>
– Returns one row for each modification
applied to the source table within the
specified LSN range
– Multiple modifications of a source row
within the LSN range will be represented
individually in the result set
• cdc.fn_cdc_get_net_changes_<capture_in
stance>
– Returns a single net change row for each
source row modified within the specified
LSN range

Determining Whether a Column
Changed
• sys.fn_cdc_get_column_ordinal (
'capture_instance', 'column_name‘ )
– Return s the ordinal position of a column name
within the specified capture instances update mask
• sys.fn_cdc_is_bit_set ( position, update_mask )
– Checks the specified ordinal position of the update
mask to determine if the change bit is set
• sys.fn_cdc_has_column_changed (
'capture_instance','column_name‘ , update_mask )
– Identifies whether the specified column has been
updated in the associated change row
– Ideally only used for post processing
– Use sys.fn_cdc_get_column_ordinal once to set the
position, and sys.fn_cdc_is_bit_set to parse the
update_mask in queries against change tables for
better performance

SQL SERVER 2012 SSISCOMPONENTS

Introduction
• SQL Server 2012 introduced new
SQL Server Integration Services
(SSIS) for CDC to simplify
extracting and consuming change
data
• Using the CDC components does
not require advanced knowledge
of SSIS to move change data from
a source system to a target for
further processing

CDC Control Task Component
• Used to control the life cycle of CDC packages in SSIS
– Synchronizes initial package load and the management of
LSN ranges processed by the CDC package executions
– Maintains the state across executions by persisting state
variable to a table
– Handles error scenarios and recovery from problems
during processing
• Supports two types of operations
– Synchronization of the initial data load and change processing
• Mark initial load start and initial load end for a full load from an
active source
• Resetting the CDC state variable to restart tracking
• Marking the CDC start from a snapshot LSN from a snapshot
database
– Management of change processing LSN ranges and tracking
what is processed successfully
• Getting a processing range before execution
• Marking a processing range after successfully processing changes

CDC Control Task Component
• Persisting state across executions
– Manual state persistence requires the package developer
to read and write the state variable for the package
– Automated state persistence reads the value of the state
variable from the table configured in the Control Task
editor to get the processing range and writes the value to
the table to mark the processed range
• Errors can be reported by the Control Task if:
– A get processing range is called after a previous get
processing range operation without the mark processed
range operation occurring
• Possibly a different package running concurrently with the same
state variable name
– Reading the persisted state variable value from the
persisted store fails
– The state variable value read from the persistent store is
not consistent
– Writing the state variable value to the persistent store fails

CDC Source Component
• Reads the processing range of change data from a capture instance change table and
delivers the changes to other SSIS components
– The processing range is derived from the state package
variable that is set by a CDC Control task executed before
the data flow starts
• The CDC source requires the following configurations:
– ADO.NET connection manager to access the SQL Server
CDC database
– The name of a table enabled for CDC
– The name of the capture instance of the table to read the
changes from
– The change processing mode to use for reading the
changes
– The name of the CDC state package variable to determine the CDC processingrange
• The CDC source does not modify that variable, a subsequent CDC
Control task execution after the data flow must be used to update
the state values

CDC Processing Modes (1)
• All
– Returns a single row for each change applied to the source
table
– Similar to querying the
cdc.fn_cdc_get_all_changes_<capture_instance> table-
valued function with the ‘all’ filter option
• All with old values
– Similar to All, but with two rows per update, one for the
Before value and one for the After value
cdc.fn_cdc_get_all_changes_<capture_instance> table-
valued function with the ‘all update old’ filter option
– The __$operation column distinguishes between Before (3)
and After (4)

CDC Processing Modes(2)
• Net
– Rolls up all changes for a key into a single row to simplify ETL processing
– Requires @supports_net_changes= true for the capture instance
– Similar to querying the cdc.fn_cdc_get_net_changes_<capture_instance>
table-valued function with the ‘all’ filter option
• Net with update mask
– Similar to Net but includes additional booleancolumns
(__$<column_name>_Changed) specifying whether a column was changed
– Similar to querying the cdc.fn_cdc_get_net_changes_<capture_instance>
table-valued function with the ‘all with mask’ filter option

CDC Processing Modes(3)
• Net with merge
– Groups INSERT and UPDATE operations
together making it easier to use the
MERGE statement (__$operation = 5)
cdc.fn_cdc_get_net_changes_<capture_i
nstance> table-valued function with the
‘all with merge’ filter option
– Only the DELETE and UPDATE split paths
will receive rows from the CDC Splitter in
this mode

CDC Splitter Components
• Splits a single input of change rows from the CDC
Source component into separate outputs for Insert,
Update and Delete operations based on the
__$operation column value from the change table
– 1 –Delete
– 2 –Insert (not available using Net with Merge mode)
– 3 –Before Update row (only when using All with Old Values
mode)
– 4 –After Update row
– 5 –Merged Update row (only when using Net with Merge
mode)
• The CDC Source for the Data Flow must have the
NetCDCprocessing mode configured to use the CDC
Splitter
• No advanced configuration is required for the CDC
Splitter

Package Design Considerations
Configure separate packages for handling Initial Load and Incremental Loads
Initial load will mark the start LSN before transferring data
from the source, and the end LSN after using the CDC tracking
variable for all tables associated with the data flow
Facilitates easier re-initialization if necessary from the source system
• Error handling considerations need to be made when operation order must be
maintained as a part of the data flow
– CDC components can redirect error rows when appropriate
to prevent component failures but may result in out-of-
order processing of changes
• Consider using staging tables to fast load change data
and perform batch processing of changes in Transact-
SQL to prevent row-by-row processing of changes in
SSIS
– Change from ETL (SSIS processing of rows) to ELT (database engine processing) to benefit from set based operations

CDC Setup
• Step 1. Enable CDC for database
USE
AdventureWorks2012
GO
EXEC
sp_changedbowner'sa
'
GO
EXEC
sys.sp_cdc_enable_d
b
GO
CDC Functions
CDC Stored Procedures
CDC Tables

CDC Setup
• Step 2. Enable CDC for table(s)
USE AdventureWorks2012
GO
EXEC sys.sp_cdc_enable_table
@source_schema= N'Production'
,@source_name= N'Product'
,@role_name= N'cdc_Admin'
,@capture_instance=
N'Production_Product'
,@supports_net_changes= 1
cdc.Production_Product_CT

Anatomy of a CDC Table
• __$start_lsnand __$seqval
– Link record to a transaction
– Specify order of operations
• __$operation
– 1 = delete
– 2 = insert
– 3 = update (record data before change)
– 4 = update (record data after change)
– 5 = merge
• __$update_mask
– Identify which columns changed
– Use with Sys.fn_cdc_has_column_changed

CDC in Integration Services
SSIS manages
current state of
CDC processing
here
Source Database
Staging Database
Table Structures
One staging
table per type
of change with
source AND
change
columns
EXCEPT…
Updates table
includes
ChangeType
when both Type 1
and Type2
processing
required

CDC in Integration Services Control
Flow-Extraction
CDC
Control
Task to
mark
beginning
and end of
processing
Truncate 3 staging tables
Incremental LoadInitial Load

Data Flow Task –Extraction
Incremental Load Only

Control Flow –Transform and
Load

Data Flow –Transform and
Load
SELECT [__$start_lsn]
,[__$operation]
,[__$update_mask]
,[ProductID]
,[Name]
. . .
FROM
[stage].[stageProduct_Inse
rts]
UNION ALL
SELECT [__$start_lsn]
,[__$operation]
,[__$update_mask]
,[ProductID]
,[Name]
. . .
FROM
[stage].[stageProduct_Upda
tes]
WHERE ChangeType= 2

SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING

SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING

Similar to SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING (20)

More from Skillwise Group

More from Skillwise Group (20)

Recently uploaded

Recently uploaded (20)

SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING