Implementing Change Data Capture for a Slowly Changing Dimension in SSIS 2005 Roderick Lee, 2010
Employee Rates Data Flow The process must execute a Lookup on the target table for each incoming record to distinguish inserts and updates. Also, without separate tracking data, the count of incoming records is the size of the source table. Sample Multi-Purpose Data Flow for both Inserts and Updates
Change Data Capture image from Microsoft Books Online, 2008 Change Data Capture (CDC) is an automated operation that records transactional activity in the source table (inserts, updates, and deletes). This streamlines the ETL procedure because there is no need to compare all the data in the target table to identify changes. Also, it increases efficiency by limiting the source pool to already identified changes. SQL Server 2008 has full CDC support and implements the capture process by writing transaction log activity into a set of specialized CDC tables. This is a new feature which did not exist in SQL Server 2005. Even without the automated transaction log tracking, there are other methods of developing a capture process. This demonstration uses triggers to load the changes in a CDC change table which is similar in design to the 2008 version.
Tables Original Target Table Adapted for SCD Type 2 CDC Table Source Table The five preliminary CDC columns demonstrate the SQL Server 2008 change table architecture. <ul><li>“ lsn” = log sequence number </li></ul><ul><li>The update mask column is a bit mask datatype, one bit per original source column </li></ul><ul><li>Insert and process dates track CDC progress because there are no actual log sequence numbers </li></ul>
CDC Test Inserts and Updates Result set in the CDC table tracking the changes. Note, the updates create two records. Test script with inserts, updates, and deletes
SCD Data Flow CDC for Slowly Changing Dimension The SCD transform determines insert or update without the need for a Lookup transform. The conditional split is based on the CDC_$operation column. Note, the source table for this data flow is the CDC table
Near Real-Time Changes Reduce Source-Target Latency By running the SSIS package as a recurring job in the background, can reduce the latency interval to the execution time of the complete CDC process. For this demonstration, there is a single data flow, so a For Loop container can serve a similar purpose. The data flow executes multiple times within the loop and captures any changes to the CDC table.
Final Results A second set of inserts and updates and the corresponding changes to the CDC and target tables, mere seconds later.