A data driven etl test framework sqlsat madison


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A data driven etl test framework sqlsat madison

  1. 1. SQL Saturday Madison – March 29th, 2014
  2. 2. Who Am I? • Terry Bunio • Data Base Administrator - Oracle, SQL Server 6,6.5,7,2000,2005,2012, Informix, ADABAS • Sharepoint fan • Data Modeler/Architect - Investors Group, LPL Financial, Manitoba Blue Cross, Assante Financial, CI Funds, Mackenzie Financial - Normalized and Dimensional • Agilist - Innovation Gamer, Team Member, SQL Developer, Test writer, Sticky Sticker, Project Manager, PMO on SAP Implementation - Currently all of these all the same time on an Agile SAP implementation for an Insurance company
  3. 3. Protegra • In Business 13+ years • Offices in Winnipeg and San Diego • Agile and Lean key to our practices • 70+ consultants • Software Development and Business Performance Consulting • Microsoft Gold Partner • Culture and Community are cornerstones of the company
  4. 4. Our Mission
  5. 5. Learning Objectives • Why do we automate tests? • Why is Automated Database Testing Difficult? • What was the problem? • What is tSQLt and SQLtest? • How we create a Data Driven ETL Test Framework?
  6. 6. Agile
  7. 7. Agile • Agile software development is a group of software development methods based on iterative and incremental development, where requirements and solutions evolve through collaboration between self- organizing, cross-functional teams. It promotes adaptive planning, evolutionary development and delivery, a time- boxed iterative approach, and encourages rapid and flexible response to change. It is a conceptual framework that promotes foreseen interactions throughout the development cycle.
  8. 8. Waterfall
  9. 9. Agile
  10. 10. Waterfall versus Agile
  11. 11. Waterfall • Do as much as you can before you fall off that cliff because it is difficult to go back up - Thorough Business Case - Detailed Requirements - Detailed Documentation - Extensive Construction Phase - Extensive Test Phase - Critical Deployment Phase
  12. 12. Agile • Do as little as you can before you can proceed because we want to deliver asap and are going to have another cycle anyway… - Lean Start-up Business Case - Minimal Requirements (some in the form of Tests) - Minimal Design - Minimal Construction - Minimal Test - Minimal Deployment
  13. 13. Automation • We usually try to automate tasks that: - Take a lot of time - That we need to do frequently - That are error prone - That the required people expertise can be captured and duplicated
  14. 14. What is more awesome than an robot Ninja?
  15. 15. Agile Automation • In Agile, the tasks that lend themselves to automation are: - Testing – Automated Testing - Deployment – Continuous Integration • This is primarily due to: - The amount of times they need to be run - The potential for people to introduce errors - The ability to test and catch errors early - The ease of automation, the ease of transferring human logic into software
  16. 16. Agile
  17. 17. Dangers of Automated Testing
  18. 18. Dangers of automated testing • Automated testing can provide great benefit, but you need to ensure that it is flexible, nimble, and easy to change and adapt • Consider how you can create your tests to make them as easy as possible to maintain and adapt to change • If you design for this, the chances are your automated tests will be more valuable and evolve with the software - Otherwise some may deem them ‘too expensive’ to modify and enhance and they will be left behind
  19. 19. Automated Database Testing
  20. 20. Automated Database Testing • Unit Testing databases is hard - Requires modification of the data in the database - Complex Querying of Multiple Tables - Multiple types of objects • Views • Functions • Stored Procedures • Triggers - Many types of Database Technologies - Need to test structure(schema), state(data), and process(Stored Procedures and ETL processes)
  21. 21. Automated Database Testing • Lots of existing frameworks out there for automated testing in programming languages - Java – Junit - .NET – Nunit - Fitnesse - Etc… • Although you could use these frameworks to perform automated database testing, it certainly isn’t easy to adapt - Refer to Lion on ice-floe cartoon
  22. 22. What we need.. • Is an Automated Database testing framework that - Doesn’t require another programming language to run - Handles the complexity of adding and deleting data for tests - Provides additional functionality that makes database testing easy • For example, a table to table comparison test or ability to create fake data and tables - Makes it easy to manage the large amount of tests that can be created in a database - Allows us to create database tests on: • Structure • Data • Metadata
  23. 23. The Project • Agile Data Warehouse project for SAP Implementation • Health Benefits Provider • I’d worked on multiple Data Warehouse projects with little or no test automation in the past - #1 item from those retrospectives were that we would create automated tests next time for the ETL process
  24. 24. The Project • Operational Data Store and Data Warehouse • SQL Server 2012 - SSIS for ETL process • We also used the CDC functionality - SSRS for reporting with Sharepoint as the report portal • Sharepoint Foundation 2010 • 400+ tables • 2.5 terabytes of data - Just for the initial load • 3640 columns in the Operational Data Store • 2520 columns in the Data Warehouse
  25. 25. The quest
  26. 26. Results • We reviewed several candidates and none of them really fit what we required • We looked at - Junit - Dbunit - FitNesse - Other SQL specific ones • Until….
  27. 27. tSQLt
  28. 28. tSQLt • tSQLt is an Open Source database unit testing framework for Microsoft SQL Server. tSQLt is compatible with SQL Server 2005 (service pack 2 required) and above on all editions. - Currently using it on SQL Server 2012 - No issues
  29. 29. tSQLt • Tests are automatically run within transactions – this keeps tests independent and reduces any cleanup work you need • Tests can be grouped together within a schema – allowing you to organize your tests and use common setup methods - Highly recommended • Output can be generated in plain text or XML – making it easier to integrate with a continuous integration tool • Provides the ability to fake tables and views and create stored procedure spies – allows you to be notified when developers aren’t following your standards!!
  30. 30. tSQLt Assertions • AssertEmptyTable • AssertEquals • AssertEqualsString • AssertEqualsTable • AssertNotEquals • AssertObjectExists • AssertResultSetsHaveSameMetaData • Fail • AssertLike
  31. 31. tSQLt Creation and Execution Commands • NewTestClass - tSQLt.NewTestClass creates a new test class. A test class is simply a schema where the user can create test case procedure and any other related objects. • DropClass - tSQLt.DropClass drops a schema and all objects belonging to it. If the schema does not exist, nothing happens. • RunAll - tSQLt.RunAll executes all tests in all test classes created with tSQLt.NewTestClass in the current database. If the test class schema contains a stored procedure called SetUp, it is executed before calling each test case.
  32. 32. tSQLt Creation and Execution Commands • Run tSQLt.Run is a flexible procedure allowing three different ways of executing test cases: 1. Providing a test class name executes all tests in that test class. If a SetUp stored procedure exists in that test class, then it is executed before each test. 2. Providing a test case name executes that single test. 3. Providing no parameter value executes tSQLt.Run the same way the previous call to tSQLt.Run was made when a parameter was provided.
  33. 33. tSQLt additional commands • ApplyConstraint - We want to be able to test constraints individually. We can use FakeTable to remove all the constraints on a table, and ApplyConstraint to add back in the one which we want to test. - ApplyConstraint in combination with FakeTable allows constraints to be tested in isolation of other constraints on a table. • ApplyTrigger - We want to be able to test triggers individually. We can use FakeTable to remove all the constraints and triggers from a table, and ApplyTrigger to add back in the one which we want to test. - ApplyTrigger in combination with FakeTable allows triggers to be tested in isolation of constraints and other triggers on a table.
  34. 34. tSQLt Additional Commands • FakeTable - We want to keep our test cases focused and do not want to insert data which is irrelevant to the current test. However, table and column constraints can make this difficult. - FakeTable allows tests to be written in isolation of the constraints on a table. FakeTable creates an empty version of the table without the constraints in place of the specified table. Therefore any statements which access the table during the execution of the test case are actually working against the fake table with no constraints. When the test case completes, the original table is put back in place because of the rollback which tSQLt performs at the end of each test case.
  35. 35. tSQLt Demo • Installation procedures • Simple Test Demonstration • Simple tSQLt Demonstration
  36. 36. tSQLt installation procedures • Download tSQLt from tSQLt. • Unzip the file to a location on your hard drive. • Make sure CLRs are enabled on your development server by running the following sql: - EXEC sp_configure 'clr enabled', 1; - RECONFIGURE; • Execute the Example.sql file from the zip file to create an example database (tSQLt_Example) with tSQLt and test cases.
  37. 37. tSQLt installation • Your database must be set to trustworthy for tSQLt to run. Execute the following script in your development database: - DECLARE @cmd NVARCHAR(MAX); SET @cmd='ALTER DATABASE ' + QUOTENAME(DB_NAME()) + ' SET TRUSTWORTHY ON;'; EXEC(@cmd); • Execute the tSQLt.class.sql script (included in the zip file) in your development database.
  38. 38. Test Demonstration
  39. 39. Test Demonstration • Old School Testing - Open Enterprise Manager - Open /SDEC12/SDEC solution - Open solution explorer - 1 testStatus1.sql - 2 testStatusAll.sql • Boring…
  40. 40. Test Demonstration • testTable - Traditional test • testTableNew - tSQLt test • AssertEqualsTable test success • Insert extra record • AssetEqualsTable test failure
  41. 41. tSQLt Demo • 4 tSQLt-1.sql
  42. 42. SQLTest
  43. 43. SQLTest • Developed by Red Gate • SQL Test is a unit test add-in for SQL Server Management Studio • Part of the SQL Developer Bundle • $369
  44. 44. SQLTest Demo • Open SQL Test • Test Class TstContainer_sdec12 • TestCase1 – count rows • TestCase2 – Fake Table and Assert Empty Table • TestCase3 – Fake Table and Assert Empty Table - Error • TestCase4 - AssertLike • Run All
  45. 45. SQLTest Additional Features • SQLCop Tests - Procedures with SP_ - VarChar Size Problems - Decimal Size Problem - Undocumented Procedures - Procedures without SET NOCOUNT ON - Procedures with SET ROWCOUNT - Procedures with @@Identity - Procedures with dynamic sql - Procedures using dynamic sql without sp_executesql
  46. 46. SQLTest Additional Features • Template for adding tests • Easier graphical interface to enter and review tests • One click to run tests - Individually - Schema - Database • Click through on TestCases to debug and modify
  47. 47. tSQLtclr Weird Error
  48. 48. tSQLtclr Weird Error • It is possible that when you restore a database from a different server you can get this issue when trying to load the tSQLtclr assembly • KB link : http://support.microsoft.com/kb/918040 • Easiest solution is to ensure sa is the database owner on the database running the tSQLt tests - EXEC sp_changedbowner 'sa'
  49. 49. ETL Data Driven Test Framework
  50. 50. ETL Test Categories - Data completeness - validates that all is loaded correctly. This involves a field by field comparison of expected versus actual results. - Data transformation – validates that all data is transformed correctly according to specifications. - Data quality - validates that the ETL application correctly corrects, ignores, or rejects critical data errors, substitutes default values, and reports invalid data. - Process Integrity – validates that the ETL process correctly executes, reports data errors, generates log messages, and is able to restart and recover from an abnormal termination correctly. - Test Automation – validates that the ETL testing process can be re-run at any time in an automated fashion to validate that the process is operating as required.
  51. 51. ETL Testing Additional Value - Performance and scalability - validates that the ETL process performs within expected time frames and that the technical architecture is scalable and extendable. - Integration testing - validates that the ETL process functions well within the entire environment. Validates that other applications can interact with the ETL process and databases as required. - Regression testing – validates that the automated tests can be utilized to confirm existing functionality continues to operate as required for each new release.
  52. 52. ETL Test Objectives • We need an Agile Test Framework so it is easy to adapt to changes • We can’t afford for the Test framework to take material time to manage on top on Data Models and Databases - Data Driven • Although SQLTest is good, you can imagine how long it would take to add tests for every field - Not feasible - Would also create a maintenance nightmare as the schema changes • So what is a DBA to do?
  53. 53. ETL Test Objectives • To create an automated ETL test framework that is Data Driven by - NEGATIVE TESTING - Information_schema metadata in SQL Server to loop through the tables and fields and execute health checks • Table level • Have we missed tables or columns in our mapping - POSITIVE TESTING - By Data Mapping metadata in SQL Server to loop through all fields and validate the ETL process • Column level • Have we mapped tables and columns correctly
  54. 54. ETL Smells
  55. 55. ETL Smells • Larger smells that indicate problems in the ETL - Empty tables - Empty columns - Columns with all Nulls - Columns with only 1 value - Numeric columns with all zeros • Usually detected by negative testing
  56. 56. Negative Tests • TstTableCount: Compares record counts between source data and target data. This will be done on table to table basis. This will be done starting from the target table and comparing to the associated source table or tables. • TstTableColumnDistinct: Compares counts on distinct values of columns. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields. • TstTableColumnNull: Generates a report of all columns where all the contents of a field is all null. This typically can highlight situations where the column was not assigned in the ETL process.
  57. 57. ETL Aromas
  58. 58. Positive Tests • TstColumnDataMapping: Compares columns directly assigned from a source column on a field by field basis for specified rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation. • TstColumnConstantMapping: Compares columns directly assigned from a constant on a field by field basis for specified rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  59. 59. Positive Tests • TstColumnFKDataMapping: Compares whether transformed columns in source and target are both not null. This is another key indicator that Foreign Key Values for related tables were unable to be located in the target database • TstInvalidColumnDataMapping: Tests that an Invalid value results in the value being assigned a default value or Null. This record will be added to the staging table to test the process.
  60. 60. Advanced Negative Tests – not yet implemented • TstInvalidParentFKColumn: Tests that an Invalid Parent FK value results in the records being logged and bypassed. This record will be added to the staging table to test the process. • TstInvalidFKColumn: Tests that an Invalid FK value results in the value being assigned a default value or Null. This record will be added to the staging table to test the process.
  61. 61. Data Driven
  62. 62. Data Mapping Framework Design • We can create two tables that when used in conjunction with Information_schema can generate all negative tests and simple positive tests - table_data_mapping - column_data_mapping • We can then create two more tables that can be used to generate more complicated positive tests - etl_test_case - etl_test_case_transform • The database also requires a cross_reference table that converts old PK values to new PK values
  63. 63. ETL Test Framework Data Model
  64. 64. table_data_mapping Column Name Description Table Data Mapping ID Surrogate Key Source Database Database that the source data exists in Source Table Table that the source data exists in Source Primary Key Source Primary Key Column Name Target Database Database that the data will be loaded into Target Table Table that the data will be loaded into Target Primary Key Primary Key Column Name Target Table Type Type of Table. Valid Values are: “Main” – Main Table “Reference” – Reference Table Purpose : Defines Table to Table ETL Mapping
  65. 65. column_data_mapping Column Name Description Column Data Mapping ID Surrogate Key Table Data Mapping ID ID of the related Table Data Mapping record Source Column Column that the source data exists in Source Constant Constant value that is to be assigned/expected in the target column Target Column Column that the data will be loaded into Data Mapping Type Formula for column assign. Typical examples would be: “=” – Assignment “Constant” – Literal Assignment “Transform” – Column Transformation “Generated” – Column Generated Purpose : Defines Column to Column ETL Mapping
  66. 66. Data Mapping Complex Framework Design • We need two additional tables that can be used to generate the more complex test cases • These tables are used to stored primary key values to select which rows we will use as test cases - We can then choose to test 1 row or 1,000 rows
  67. 67. etl_test_case Column Name Description ETL Test Case ID Surrogate Key Table Data Mapping ID ID of the related Table Data Mapping record Target Primary Key Value Target Primary Key Value for the test Successful Test Ind Indicate if the test was successful Purpose : Defines row to row ETL Mapping for a test case
  68. 68. etl_test_case_transform Column Name Description ETL Test Case Transform ID Surrogate Key ETL Test Case ID ID of the related Test Case record Target Column Target Column name Expected value Expected value in target column Successful Test Ind Indicate if the test was successful Purpose : Defines expected target column value for each test case defined in etl_test_case
  69. 69. One more table… cross_reference Column Name Description cross_reference_id Surrogate key source_system_code Source system name source_system_table Table row originated from in source system source_system_val Primary Key value from source system target_table_name Table row was stored to in target system target_id Primary key value from target system updated_date Date row was updated modified_by User that modified the row deleted_ind Indicated if the row is deleted
  70. 70. Data Mapping table in Excel that is used to populate tables Column Name Description Target_db Target database that target column exists in Target_table Target table that target column exists in Target_table_type Target table type. Values are Main or Reference Target_column Target column that is being loaded by the ETL process Source_db Source database that source column is being extracted from Source_table Source table that source column is being extracted from Source_column Source column that is being extracted Source_PK Primary Key of the source table that the source column exists in. This field will be used in conjunction with the cross_reference table to determine the target_pk value so equivalent records can be compared between the source and target databases Mapping_type Type of data mapping for the column. The values are:  Constant/”=“/Not Populated/Transform/Generated Data_type Data type of the target column. Valid values are:  Integer/bigint/nvarchar/datetime/indicator Nullable_ind Indicates whether the target column is nullable Invalid_test Contains the values that can be tested to cause exceptions
  71. 71. The Goods
  72. 72. TstTableCount Demo – 1 minute • TstColumnCountLoop • TstColumnCount
  73. 73. TstTableColumnNull Demo – 15 minutes • TstColumnColumnNullLoop • TstColumnColumnNull
  74. 74. TstTableColumnDistinct Demo – 20 minutes • TstColumnColumnDistinctLoop • TstColumnColumnDistinct • Set ansi_warnings off
  75. 75. TstColumnDataMapping Demo – 1 minute • TstColumnDataMappingLoop • TstColumnDataMapping
  76. 76. TstColumnConstantMapping Demo – 1 minute • TstColumnConstantMappingLoop • TstColumnConstantMapping
  77. 77. Questions?