Bad Data - Common Big Data & Data Warehouse Defects

4,383 views
5,721 views

Published on

The QuerySurge™ team has put together a collection of the common defects that QuerySurge™ typically finds in data warehouse projects. These defects can cause your data warehouse and, ultimately, your Business Intelligence reports to contain bad data.

According to Gartner, bad data costs companies $8.2 million annually, with 22% estimated their annual losses resulting from bad data at $20 million and 4% put that figure as high as an astounding $100 million.

InformationWeek states that 46% of companies cite Data Quality as a barrier for adopting Business Intelligence products.

Obviously, data quality is becoming increasingly important and data grows exponentially. Most companies are using Business Intelligence (BI) to make strategic decisions in the hope of gaining a competitive advantage in a tough business landscape. But Bad Data will cause them to make decisions that will cost their firms millions of dollars.

QuerySurge finds bad data in Big Data.

For more information, visit www.QuerySurge.com

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
4,383
On SlideShare
0
From Embeds
0
Number of Embeds
2,785
Actions
Shares
0
Downloads
52
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bad Data - Common Big Data & Data Warehouse Defects

  1. 1. © 2014 Real-Time Technology Solutions, Inc. All Rights Reserved. (212) 240-9050 | info@rttsweb.com Common Defects in Big Data & Data Warehouses Below is a description of common defects that QuerySurge typically finds in Big Data and Data Warehouse projects. These defects will cause your project and, ultimately, your Business Intelligence and Analytics reports to have bad data. Since C-level executives base their strategic decisions on this data, this could cause a loss of millions of dollars. According to Gartner, bad data costs companies $8.2 million annually. Issue Description Possible Causes Example(s) Missing Data Data that does not make it into the target database - By invalid or incorrect lookup table in the transformation logic - Bad data from the source database (Needs cleansing) - Invalid joins Lookup table should contain a field value of “High” which maps to “Critical”. However, Source data field contains “Hig” - missing the h and fails the lookup, resulting in the target data field containing null. If this occurs on a key field, a possible join would be missed and the entire row could fall out. Truncation of Data Data being lost by truncation of the data field - Invalid field lengths on target database - Transformation logic not taking into account field lengths from source Source field value “New Mexico City” is being truncated to “New Mexico C” since the source data field did not have the correct length to capture the entire field. Data Type Mismatch Data types not setup correctly on target database Source data field not configured correctly Source data field was required to be a date, however, when initially configured, was setup as a VarChar. Null Translation Null source values not being transformed to correct target values Development team did not include the null translation in the transformation logic A Source data field for null was supposed to be transformed to ‘None’ in the target data field. However, the logic was not implemented, resulting in the target data field containing null values. Wrong Translation Opposite of the Null Translation error. Field should be null but is populated with a non- null value or field should be populated but with wrong value Development team incorrectly translated the source field for certain values Ex. 1) Target field should only be populated when the source field contains certain values, otherwise should be set to null Ex. 2) Target field should be “Odd” if the source value is an odd number but target field is “Even” (This is a very basic example.) Misplaced Data Source data fields not being transformed to the correct target data field Development team inadvertently mapped the source data field to the wrong target data field A source data field was supposed to be transformed to target data field ‘Last_Name’. However, the development team inadvertently mapped the source data field to ‘First_Name’ Extra Records Records which should not be in the ETL are included in the ETL Development team did not include filter in their code If a case has the deleted field populated, the case and any data related to the case should not be in any ETL Not Enough Records Records which should be in the ETL are not included in the ETL Development team had a filter in their code which should not have been there If a case was in a certain state, it should be ETL’d over to the data warehouse but not the data mart
  2. 2. © 2014 Real-Time Technology Solutions, Inc. All Rights Reserved. (212) 240-9050 | info@rttsweb.com Common Defects in Big Data & Data Warehouses Transformation Logic Errors/Holes Testing sometimes can lead to finding “holes” in the transformation logic or realizing the logic is unclear Development team did not take into account special cases. For example international cities that contain special language specific characters might need to be dealt with in the ETL code Ex. 1) Most cases fall into a certain branch of logic for a transformation, but a small subset of cases (sometimes with unusual data) may not fall into any branches. How the testers’ and developers’ coding handles these cases could be different (and may both end up being wrong) and the logic is changed to accommodate the cases. Ex. 2) Tester and developer have different interpretation of transformation logic, which results in different values. This leads to the logic being re-written to become clearer. Simple/Small Errors Capitalization, spacing and other small errors Development team did not add an additional space after a comma for populating the target field. Product names on a case should be separated by a comma and then a space but target field only has it separated by a comma Sequence Generator Ensuring that the sequence number of reports are in the correct order is very important when processing follow up reports or answering to an audit Development team did not configure the sequence generator correctly resulting in records with a duplicate sequence number Duplicate records in the sales report were doubling up several sales transactions which skewed the report significantly Undocumented Requirements Find requirements that are “understood” but are not actually documented anywhere Several of the members of the development team did not understand the “understood” undocumented requirements. There was a restriction in the “where” clause, limiting how certain reports were brought over. Used in mappings that were understood to be necessary, but were not actually in the requirements. Occasionally, it turns out that the understood requirements are not what the business wanted. Duplicate Records Duplicate records are two or more records that contain the same data Development team did not add the appropriate code to filter out duplicate records Duplicate records in the sales report were doubling up several sales transactions which skewed the report significantly Numeric Field Precision Numbers that are not formatted to the correct decimal point or not rounded per specifications Development team rounded the numbers to the wrong decimal point The sales data did not contain the correct precision and all sales were being rounded to the whole dollar Rejected Rows Data rows that get rejected due to data issues Development team did not take into account data conditions that break the ETL for a particular row Missing data rows on the sales table caused major issues with the end of year sales report For more information on QuerySurge or to download a trial, visit www.querysurge.com QuerySurge is the collaborative Data Testing solution for Big Data that finds bad data and provides a holistic view of your data’s health.

×