Data Warehouse Testing


Published on

Increasingly, businesses are focusing on the collection and organization of data for strategic decision making. The ability to review historical trends and monitor near real-time operational data has become a key competitive advantage. SQA Solution provides practical recommendations for testing extract, transform, and load (ETL) applications based on years of experience testing data warehouses in the financial services and consumer retailing areas.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Warehouse Testing

  1. 1. Data Warehouse TestingIncreasingly, businesses are focusing on the collection and organization of data for strategicdecision making. The ability to review historical trends and monitor near real-time operationaldata has become a key competitive advantage. SQA Solution provides practicalrecommendations for testing extract, transform, and load (ETL) applications based on years ofexperience testing data warehouses in the financial services and consumer retailing areas.A conceptual diagram for ETL and Data Warehouse Testing.There is definitely a significantly escalating cost connected with discovering software defects 1/5
  2. 2. later on in the development lifecycle. In data warehousing, this can be worsened due to theadded expenses of utilizing incorrect data in making important business decisions. Given theimportance of early detection of software defects, here are some general goals of testing anETL application: Data completeness. Ensures that all expected data is loaded. Data transformation. Ensures that all data is transformed correctly according to business rules and/or design specifications. Data quality. Makes sure that the ETL software accurately rejects, substitutes default values, fixes or disregards, and reports incorrect data. Scalability and performance. Makes sure that data loads and queries are executed within anticipated time frames and that the technical design is scalable. Integration testing. Ensures that the ETL process functions well with other upstream and downstream processes. User-acceptance testing. Makes sure that the solution satisfies your current expectations and anticipates your future expectations. Regression testing. Makes sure that current functionality stays intact whenever new code is released.Data CompletenessOne of the most basic tests of data completeness is to verify that all data loads correctly into thedata warehouse. This includes validating that all records, fields, and the full contents of eachfield are loaded. Strategies to consider include: Comparing record counts between source data, data loaded to the warehouse, and rejected records. Comparing unique values of key fields between source data and data loaded to the warehouse. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields. Utilizing a data profiling tool that shows the range and value distributions of fields in a data set. This can be employed during testing and in production to compare source and target data sets and point out any data anomalies from source systems that may be missed even when the data movement is correct. Populating the entire contents of every field to verify that no truncation takes place during any step in the procedure. For example, if the source data field is a string(30) ensure it is tested with 30 characters. Testing the boundaries of each field to find any database limitations. For example, for a decimal(3) field include values of -99 and 999, and for date fields include the entire range of dates expected. Depending on the type of database and how it is indexed, it is possible that the range of values the database accepts may be too small.Data TransformationValidating that data is modified properly according to business rules is the most intricate 2/5
  3. 3. component of testing an ETL application with considerable transformation logic. One techniqueis to select several sample records and “stare and compare” to verify data transformationsmanually. This is often beneficial but calls for manual testing steps and testers who understandthe ETL logic. A combination of automated data profiling and automated data movementvalidations is a better long-term strategy. Here are some simple automated data movementtechniques: Create a spreadsheet of scenarios of input data and expected results and validate these with the business customer. This is an excellent requirements elicitation step during design and could also be used as part of testing. Create test data that includes all scenarios. Utilize an ETL developer to automate the entire process of populating data sets with the scenario spreadsheet to permit versatility and mobility for the reason that scenarios are likely to change. Utilize data profiling results to compare range and submission of values in each field between target and source data. Validate accurate processing of ETL-generated fields; for example, surrogate keys. Validate that the data types within the warehouse are the same as was specified in the data model or design. Create data scenarios between tables that test referential integrity. Validate parent-to-child relationships in the data. Create data scenarios that test the management of orphaned child records.Data QualitySQA Solution defines data quality as “how the ETL system deals with data rejection,replacement, correction, and notification without changing any of the data.” To achieve successin testing data quality, we incorporate many data scenarios. Typically, data quality rules aredefined during design, for example: Reject the record if a certain decimal field has nonnumeric data. Substitute null if a certain decimal field has nonnumeric data. Validate and correct the state field if necessary based on the ZIP code. Compare the product code to values in a lookup table. If there is no match, load anyway; however, report this to our clients.Dependant upon the data quality rules of the software we are testing, specific scenarios to testcould involve duplicate records, null key values, or invalid data types. Review the detailed testscenarios with business clients and technical designers to ensure that all are on the same page.Data quality rules applied to the data will usually be invisible to the users once the application isin production; users will only see what’s loaded to the database. For this reason, it is importantto ensure that what is done with invalid data is reported to the clients. Our data quality reportsprovide beneficial information that in some cases uncovers systematic issues with the sourcedata itself. At times, it may be beneficial to populate the “before” data in the database for clientsto view. 3/5
  4. 4. Scalability and PerformanceAs the amount of data in a data warehouse increases, ETL load times may also increase.Consequently, the efficiency of queries should be expected to decline. This could be mitigatedby using a sound technical architecture and excellent ETL design. The goal of performancetesting is to uncover any potential problems in the ETL design. The following strategies will helpdiscover performance issues: Load the database with maximum anticipated production volumes to make certain this amount of data can be loaded by the ETL process in the agreed-upon timeframe. Compare these ETL loading times to loading times conducted with a reduced amount of data to anticipate possible issues with scalability. Compare the ETL processing times component by component to indicate any regions of weakness. Monitor the timing of the reject process and consider how large volumes of rejected data will be handled. Perform simple and multiple join queries to validate query performance on large database volumes. Work together with business clients to formulate test queries and overall performance requirements for every query.Integration TestingTypically, system testing only includes testing within the ETL application. The input and outputof the ETL code constitute the endpoints for the system being testing. Integration testingdemonstrates the way the software fits into the general flow of all upstream and downstreamapplications.When designing integration test scenarios, we take into account how the overall process couldpossibly break. Subsequently, we focus on touch points between applications instead of within asingle application. We take into account how process breakdowns at each and every step wouldbe managed and how data would be restored or deleted if required.Most difficulties discovered in the course of integration testing result from incorrect assumptionsabout the design of another application. Therefore, it is important to integration test withproduction-like data. Real production data is ideal, but depending on the contents of the data,there could be privacy or security concerns that require certain fields to be randomized beforeusing it in a test environment.As always, don’t forget the importance of good communication between the testing and designteams of all systems involved. To bridge this communication gap, it’s a good idea to bring teammembers from all systems together to help create test scenarios and talk about what might gowrong in production. Perform the complete process from start to finish in the exact same orderand use the same dependencies, just as you would in production. Ideally, integration testing is acombined effort and not the sole responsibility of the team testing the ETL application via Data 4/5
  5. 5. Warehouse Testing. Note: Want to learn even more about how we can help out with your testing strategy? Visit our other site,, to get detailed information about what’s included in our services. 5/5Powered by TCPDF (