Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Creating a Data validation and Testing Strategy


Published on

Creating A Data Validation & Testing Strategy

Are you struggling with formulating a strategy for how to validate the massive amount of data continuously entering your data warehouse or data lake?

We can help you!

Learn how RTTS’ Data Validation Assessment provides:
- an evaluation of your current data validation process
- recommendations on how to improve your process and
- a proposal for successful implementation

This slide deck addresses the following issues:
- How do I find out if I have bad data?
- How do I ensure I am testing the proper data permutations?
- How much of my data needs to be validated and automated?
- Which critical data endpoints need to be tested?
- How do I test data in my cloud environments?

And much more!

For more information, visit:

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Creating a Data validation and Testing Strategy

  1. 1. Webinar Mike Calabrese Team Lead/Senior Engineer Bill Hayduk Founder/CEO Creating a Data Validation & Testing Strategy
  2. 2. Copyright Real-Time Technology Solutions, Inc. 2019 CONFIDENTIAL – DO NOT distribute
  3. 3. Facts Founded: 1996 (24th anniversary) Location: New York City (HQ) Customer profile: • Fortune 500 & mid-size • 700+ customers Strategic Partners: IBM, Microsoft, Oracle, Teradata, Cloudera, HortonWorks, MongoDB, SAP, Micro Focus Other Software Supported QuerySurge, Selenium, Appium, CitraTest, Postman, Smart Bear, JMeter, others RTTS is the premier pure-play QA & Testing firm that specializes in Test Automation
  4. 4. Data Validation Data Testing Strategies Intro Assessment Case Study Data Validation Assessment by
  5. 5. Data Validation Data Testing Strategies Intro Assessment Case Study Data Validation Assessment by RTTS
  6. 6. Handles more than 1 million customer transactions every hour. • data imported into databases that contain > 2.5 petabytes of data • the equivalent of 167 times the information contained in all the books in the US Library of Congress. Facebook handles 40 billion photos from its user base. Google processes 1 Terabyte per hour Twitter processes 85 million tweets per day eBay processes 80 Terabytes per day others Big Impacts of Big Data
  7. 7. Data Warehouse Marketplace “the worldwide data warehouse management software market is forecast to generate nearly $17 billion in revenue by 2020” - Forrester Top vendors: Oracle, Teradata, IBM, Microsoft, SAP, Micro Focus and Amazon Business Intelligence Marketplace “The business intelligence (BI) and analytics software market is forecast to grow to $22.8 billion by the end of 2020” - Gartner SAP, IBM, SAS, Microsoft, Oracle, Tableau, Qlik, MicroStrategy , Information Builders DWH, BI, Big Data Marketplaces Big Data Marketplace “By the end of 2020, companies will spend > USD $72 billion on on Big Data hardware, software, & professional services” - IDC Oracle, IBM, Microsoft, Amazon, Micro Focus, HortonWorks, Cloudera, Teradata, SAP, MongoDB, MapR, DataStax, Snowflake.
  8. 8. Legacy DB CRM/ERP DB Finance DB Source Data ETL Process Target DWH ETL Process Business Intelligence (BI) & Analytics Data Mart
  9. 9. Impacts of Bad Data “On average, poor data quality costs organizations $14.2 million annually.” a software division ofQuerySurge™ “Dirty data costs the average business 15% to 25% of revenue.” “Cleaning up data will lead to average cost savings of 33%, while boosting revenue by an average of 31%.”
  10. 10. Data Validation Data Testing Strategies Intro a software division of Assessment Case Study Data Validation Assessment by
  11. 11. What is Data Validation? Data Validation Testing The process of verifying your data is completely and accurately moved through your systems according to the business requirements. Legacy DB CRM/ERP DB Finance DB Source Data ETL Process Target DWH Extract Transform Load
  12. 12. • Data Completeness Verifying that all data has been loaded from the sources to the target Data Warehouse. Validate the correct data displays in BI reports. Data Validation Testing • Data Transformation Ensuring that all data has been transformed correctly during the extract-transform-load (ETL) process. • BI Report Testing Verify that BI Reports are formatted correctly, calculated fields are validated, and data is verified against the underlying data. DATA VALIDATION TEST TYPES • BI Performance Testing Ensure your BI Reports can be generated in a reasonable amount of time • Data Quality Ensuring that the ETL process correctly rejects, substitutes default values, corrects or ignores and reports invalid data.
  13. 13. Finding Bad Data Issue Description Possible Causes Missing Data Data that does not make it into the target database • Invalid or incorrect lookup table in the transformation logic • Bad data from the source database (Needs cleansing) • Invalid joins Truncation of Data Data being lost by truncation of the data field • Invalid field lengths on target database • Transformation logic not considering field lengths from source Data Type Mismatch Data types not set up correctly on target database Source data field not configured correctly Null Translation Null source values not being transformed to correct target values Development team did not include the null translation in the transformation logic Wrong Translation Opposite of the Null Translation error. Field should be null but is populated with a non-null value or field should be populated, but with the wrong value Development team incorrectly translated the source field for certain values Misplaced Data Source data fields not being transformed to the correct target data field Development team inadvertently mapped the source data field to the wrong target data field Extra Records Records which should not be in the ETL are included in the ETL Development team did not include filter in their code Not Enough Records Records which should be in the ETL are included in the ETL Development team had a filter in their code which should not have been there
  14. 14. Finding Bad Data (cont.) Issue Description Possible Causes Transformation Logic Errors/Holes Testing sometimes can lead to finding “holes” in the transformation logic or realizing the logic is unclear Development team did not take into account special cases. For example international cities that contain special language specific characters might need to be dealt with in the ETL code Simple/Small Errors Capitalization, spacing and other small errors Development team did not add an additional space after a comma for populating the target field. Sequence Generator Ensuring that the sequence number of reports are in the correct order is very important when processing follow-up reports or answering to an audit Development team did not configure the sequence generator correctly resulting in records with a duplicate sequence number Undocumented Requirements Find requirements that are “understood” but are not actually documented anywhere Several of the members of the development team did not understand the “understood” undocumented requirements. Duplicate Records Duplicate records are two or more records that contain the same data Development team did not add the appropriate code to filter out duplicate records Numeric Field Precision Numbers that are not formatted to the correct decimal point or not rounded per specifications Development team rounded the numbers to the wrong decimal point Rejected Rows Data rows that get rejected due to data issues Development team did not take into account data conditions that could break the ETL for a particular row
  15. 15. Challenges • How much data needs to be validated/tested? • How do I ensure I am testing the proper data permutations? • What are the critical data endpoints that need to be tested? • How do I verify that the data from my various source systems is propagating through the architecture? • How do I validate data in the cloud environments? • Is bad data making it into the architecture? • How much of the data testing can be automated?
  16. 16. COST Data Mapping Development Unit Testing QA Test Cycle UAT Testing End User Solutions Finding Bad Data • Identify testing points • Review data mappings • Data Testing Strategies • comparisons (source vs. target) • row counts • minus queries • automation tools
  17. 17. Solutions Data Testing Permutations • Analyze the data mappings • Develop a test Data Set o Review Transformation Logic ▪ Case Statements ▪ Field Merges/ Field Splitting ▪ Translations (Lookups) ▪ Derived • Replication of production data • Homegrown or Freeware • Enterprise solutions o IBM InfoSphere Optim, GenRocket, SAP, Computer Associates Test Data Generation
  18. 18. Solutions How much data to validate? • Requirements • Regulatory authorities may require 100% of your data be tested. • In other cases, 90% or 80% may be the goal. • Time, resource and scope driven • Release timeline • Available resources • Scope of authoring and executing tests • Risk Assessment • Business Acceptance Criteria – End users define their primary data use cases. • Critical Path – Validate the data the flows through the high priority data endpoints within in your system. 𝑇𝑒𝑠𝑡 𝑎𝑢𝑡ℎ𝑜𝑟𝑖𝑛𝑔 𝑡𝑖𝑚𝑒 𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 ∗ (# 𝑜𝑓 ℎ𝑜𝑢𝑟𝑠 𝑝𝑒𝑟 𝑑𝑎𝑦 𝑎𝑢𝑡ℎ𝑜𝑟𝑖𝑛𝑔 𝑝𝑒𝑟 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒) = # 𝑜𝑓 𝑑𝑎𝑦𝑠 𝑇𝑒𝑠𝑡 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 ∗ (# 𝑜𝑓 ℎ𝑜𝑢𝑟𝑠 𝑝𝑒𝑟 𝑑𝑎𝑦 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑛𝑔 𝑝𝑒𝑟 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒) = # 𝑜𝑓 𝑑𝑎𝑦𝑠
  19. 19. Solutions Automation vs Manual • Recurrence • Avoid complicated single use test cases • Focus on repeatable testing paths • Ensure modularization of test data sets • Test Data Sets • Consider automation tool’s assigned hardware resources and performance which must be able to handle the load of the data set under test • Include time needed to prepare environments into your testing estimates • Database Performance • Set expectations on database hardware & responsiveness. • SQL query response time will factor into overall test run times
  20. 20. Solutions How do I test data in my cloud environment ? • On-Prem vs Cloud o Follow the same testing methodologies but with considerations for cloud connections and scalability o If an automated solution is being pursued, confirm the tools involved allows for connectivity to your cloud environment • Hybrid-Could Mapping o Interface documentation o Define entry & exit points if applicable • Digital Transformation o Clearly defined conversion requirements and mappings • Environment Scalability • Define limitations on testing environment resources
  21. 21. Data Validation Data Testing Strategies Intro a software division of Assessment Case Study Data Validation Assessment by
  22. 22. Data Validation Assessment What are the goals of a Data Validation assessment? • Receive an expert evaluation of your current data validation process • Provide recommendations on how to improve your process • Proposal for successful implementation of your goals
  23. 23. Data Validation Assessment Components of the Assessment • Business analysis • Data architecture analysis • ETL testing process evaluation • DataOps & DevOps evaluation • Resource evaluation (optional) • Metrics evaluation • Risk assessment
  24. 24. Data Validation Assessment Interview with Key Players • Business/Data Analysts create requirements • QA Testers develop and execute test plans and test cases • Architects set up environments • Developers create ETL code, perform unit tests • DBAs test for performance and stress • Business Users perform functional User Acceptance Tests
  25. 25. Data Validation Assessment Process Review • Review Requirements & Mapping documentation • Testing Process Design • Analysis of tools and DevOps/DataOps • Reporting metrics evaluations
  26. 26. Data Validation Assessment Deliverables • Detailed analysis report with recommendations for improvement • Presentation to your team on our findings • Proposal for successful implementation of your goals
  27. 27. Data Validation Data Testing Strategies Intro a software division of Assessment Case Study Data Validation Assessment by
  28. 28. ETL Developer: Codes data movement based on Mapping Requirements Data Warehouse ETL Data Tester: Tests data movement based on Mapping Requirements Data Mart ETL Source Data Big Data lake Testing Point #1 Testing Point #2 Testing Points #3 BI & Analytics Testing Point #4 Tester tests BI Reports BI Analyst extracts data for reports Data Testing - Developer & Tester
  29. 29. Source-to-Target Map It’s the critical element required to efficiently plan the target Data Stores. It also defines the Extract, Transform, Load (ETL) process. Intention: ✓ capture business rules ✓ data flow mapping and ✓ data movement requirements. Mapping Doc specifies: ▪ Source input definition ▪ Target/output details ▪ Business & data transformation rules ▪ Absolute data quality requirements ▪ Optional data quality requirements. Data Requirements = Mapping Document
  30. 30. Data Testing Strategies Testing Methods Minus Queries – Create a SQL source query and a SQL Target query. Utilizing SQL, subtract source query results from target query results and subtract target query results from source query results Visual Compare – View source data and target data and manually compare Record Counts – Creating a SQL source and target query to return a record counts and comparing the values Automation – Utilizing an automation tool to compare SQL source and target query results
  31. 31. Sampling Level 1 Sampling a % of data by visually comparing data sets. Not repeatable. Excel, Ad Hoc Reporting Level 2 Using Excel or other homegrown method. Ad hoc reporting. Minus Queries Level 3 Utilizing SQL editor & minus queries to test data. More detailed reporting. Data Test Automation Level 4 Repeatable test automation, agreed-upon process, centralized reporting. On which Level should your process be? Data Quality Optimizing Level 5 Full automation, tracking of ROI, predictive data issues, auditable results. Business value is fully understood/supported by management. Data Maturity Model - Test Execution
  32. 32. Data Validation Data Testing Strategies Intro a software division of Assessment Case Study Data Validation Assessment by
  33. 33. A company in the financial industry had a development and QA team assigned to their ETL process. But there were still issues: Case Study • They were still suffering from incorrect data fields populating their Business Intelligence (BI) reports • Development cycles were frequently delayed • Management was losing confidence in the BI reporting data CASE STUDY OVERVIEW
  34. 34. Senior RTTS resources were brought in to assess the process • Interview key players • Review process documentation and tools • Minimal requirements • Ticketing system was not being implemented for traceability • Testing process of low-level maturity o Table row counts o Sampling o Excel comparisons Problem areas identified: Case Study Resource needs:
  35. 35. Case Study Recommendations for Improvement • Centralized mapping documentation o Linking requirements to work items tickets to test cases. • Improve communications between team members we recommended a new Data Analyst role • Narrowed focus of the stand-up meetings • Implemented automated solutions to expand coverage for larger data sets
  36. 36. DEMO: Automating your data validation & testing
  37. 37. Any questions? Creating a Data Validation & Testing Strategy