Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Quality Solutions and Bad Data

1,002 views

Published on

We all know that C-level executives are making strategic decisions based on information from their BI and analytics initiatives to try to provide their firms with a competitive advantage.

But what if the data is incorrect?

How do you verify the data?

Published in: Software
  • Be the first to comment

  • Be the first to like this

Data Quality Solutions and Bad Data

  1. 1. Data Quality Solutions & Bad Data: A Case of Misplaced Confidence? copyright Real-Time Technology Solutions, Inc. November, 2015 page 1 Nov 5, 2015 We all know that C-level executives are making strategic decisions based on information from their BI and analytics initiatives to try to provide their firms with a competitive advantage. But what if the data is incorrect? Then that means they are making big bets, impacting the company's direction and future, on analyses that have underlying data that is incorrect or is bad data. I was reading some interesting articles on big data, data warehousing and data quality and came across these interesting statistics: So why is there a disconnect between the first quote and the next four quotes? If 90% of US companies are implementing some form of Data Quality solution, why are so many companies experiencing bad data issues? Data Quality vs. Data Testing In digging deeper, it becomes clear when you look at the characteristics of data quality tools. Below are characteristics from Gartner’s 2014 Magic Quadrant for Data Quality Tools:  Profiling: analysis of data to capture statistics (metadata)  Parsing and standardization: decompose text fields into components, formatting based on standards and business rules Data Quality Solutions & Bad Data: A Case of Misplaced Confidence? “90% percent of U.S. companies have some sort of data quality solution in place today” - Experian Data Quality “The average organization loses $8.2 million annually through poor Data Quality." - Gartner “On average, U.S. organizations believe 32% of their data is inaccurate” – Experian Data Quality “46% of companies cite data quality as a barrier for adopting Business Intelligence products” - InformationWeek “Poor data quality is a primary reason for 40% of all business initiatives failing to achieve their targeted benefits” - Gartner
  2. 2. Data Quality Solutions & Bad Data: A Case of Misplaced Confidence? copyright Real-Time Technology Solutions, Inc. November, 2015 page 2  Generalized "cleansing": modification of data values to meet domain restrictions, integrity constraints or other business rules  Matching: identifying, linking or merging related entries within or across sets of data  Monitoring: deploying controls to ensure that data continues to conform to business rules  Enrichment: enhancing the value of data by appending consumer demographics & geography  Subject-area-specific support: standardization capabilities for specific data subject areas  Metadata management: ability to capture, reconcile & correlate metadata related to quality process  Configuration environment: capabilities for creating, managing and deploying data quality rules So while data quality software is incredibly important, none of the above characteristics specifically deal with data validation from source files, databases, xml and other data sources through the transformation process to the target Data Warehouse or Big Data store. Data testing is completely different. According to the book "Testing the Data Warehouse Practicum" by Doug Vucevic and Wayne Yaddow, the primary goals of data testing are:  Data Completeness: Verifying that all data has been loaded from the sources to the target DWH  Data Transformation: Ensuring that all data has been transformed correctly during the Extract-Transform-Load (ETL) process  Data Quality: Ensuring that the ETL process correctly rejects, substitutes default values, corrects or ignores and reports invalid data  Regression Testing: Testing existing functionality again to ensure it remains intact for new release Data Testing Methods Many companies currently perform data testing, data validation and reconciliation, knowing their importance. The problem is that for all of the advances made in the software space in big data, data warehouses and databases, the process of data testing is still a manual one that is loaded with risk and ripe for producing massive amounts of bad data. The 2 most prevalent methods used for data testing are:  Sampling (also known as "Stare and Compare") – The tester writes SQL to extract data from the source data and from the target data warehouse or big data store, dumps the 2 result sets into Excel and performs “stare and compare”, meaning verifying the data by viewing or “eyeballing” the results. Since 1 test query can return as much as 200 million rows with 200 columns (40 billion data sets), and most test teams have hundreds of these tests, this method proves impossible to validate more than a fraction of 1% of data and thus cannot be counted on the find data errors.  Minus Queries - Using the MINUS method, the tester queries the source data and the target data and subtracts the 1st result set from the 2nd set to determine the result set difference. If there is no difference, there is no remaining result set. Then this MINUS is performed again, subtracting the 2nd
  3. 3. Data Quality Solutions & Bad Data: A Case of Misplaced Confidence? copyright Real-Time Technology Solutions, Inc. November, 2015 page 3 set from the 1st set (see example here). This has its value, but potential issues are (a) the result sets may not be accurate when dealing with duplicate rows, (b) this method does not produce historical data & reports, which is a concern for audit and regulatory reviews, and (c) processing MINUS queries puts pressure on the servers. These manual processes are tedious and inefficient, providing limited coverage of data validation and leaving the probability of bad data in these data stores and thus allowing for bad data to exist in the BI and Analytics reports. Automated Data Testing solutions to the rescue But there is help out there. A new sector of software vendors has been popping up to fill the need for automated data testing. Led by RTTS' QuerySurge, these testing solutions can provide automated comparisons of upwards of 100% of all data movement quickly, which leads to improved data quality, a reduction in data costs & bad data risks, shared data health information, and significant return on investment. So while data quality tools are an important part of the data solution, data testing compliments the data health picture and provides C-level executives and their teams with the confidence that the strategic, potentially game-changing decisions they are making are done so with validated, accurate data. About QuerySurge QuerySurge is the software division of RTTS. RTTS’ team of test experts developed QuerySurge™ to address the unique testing needs in the Big Data and Data Warehousing spaces. QuerySurge is the leading Data Testing solution built specifically to automate the testing of Data Warehouses & Big Data. QuerySurge makes it really easy for both novice and experienced team members to validate their organization's data quickly, analyzing and pinpointing up to 100% of all data differences while providing both real- time and historical views of your data’s health. To find the answer to “What is QuerySurge?” click here> To decide which trial version of QuerySurge fits your needs, click here> To see recent case studies on QuerySurge, click here>

×