SlideShare a Scribd company logo
1 of 13
Download to read offline
Data Quality Testing – A Quick Checklist to
Measure and Improve Data Quality
Did you know?
More than 70% of revenue leaders in an InsideView Alignment Report 2020
rank data management as the highest priority, yet, a Harvard Business
Review study estimates only 3 percent of companies’ data meets basic quality
standards.
There is a major gap between what companies want in terms of data quality
and what they are doing to fix it.
The first step to any data management plan is to test the quality of data and
identify some of the core issues that lead to poor data quality. Here’s a quick
guide-based checklist to help IT managers, business managers and
decision-makers to analyze the quality of their data and what tools and
frameworks can help them to make it accurate and reliable.
What is data quality and why does it
matter?
Before we delve into the checklist, here’s a quick briefing on what data quality
is and why it matters.
There is no specific definition of data quality and to give one would be to limit
the scope of data itself. There are however benchmarks that can be used to
assess the state of your data. For instance, data of high quality would mean:
● It’s error-free. No typos, no format and structure issues.
● It’s consolidated. Data is not scattered over different systems.
● It’s unique. It is not duplicated.
● It’s timely. The data is not obsolete.
● It’s accurate. You can rely on this data to make business decisions.
It’s not mandatory (but is helpful) for your data to be all of this. Data quality
matters because:
● Your business is losing money for every inaccurate data field
● Your direct mail & marketing campaigns incur unnecessary costs for
every wrong address data field
● You’re making business decisions made on flawed data
● You’re receiving inaccurate insights
● Your data is obsolete and does not fulfill its intended purpose.
Put simply, poor data, left neglected impacts every aspect of your business
process – from sales to marketing, customer support to customer service, and
team efficiency. In recent years, data quality is no longer a backburner
process. It’s affecting businesses drastically, which makes it all the more
important to treat data quality as a burning issue that needs a resolution
before it endangers the growth plan of a business.
Pre-requisites of data quality testing
Before you can test your data efficiently, it is necessary to define and set the
right expectations from the process and the data itself. Let’s look at what you
should know before starting your data quality testing process.
1. Purpose of your data
What do you want to achieve with your data?
Is it supposed to fuel your business intelligence process? Or help you
identify new market opportunities and customer segments? Whatever
the intended purpose of data is at your company, identify it. If you don’t
understand what data can do for you, you’ll never be able to measure
whether it is fulfilling its purpose.
2. Data quality metrics
What high-quality data means to you?
You must understand the metrics that will help you to measure data
quality. This could be as simple as the six critical data quality
dimensions that we all know so well. But it is better if you make this a bit
more specific to your use case. For example, the Date column in a
dataset should contain formatted dates only. But you could also have
dates that are actually garbage values since they represent dates that
are too old to be accurate. So, you could have
your own, more specific definition of what accurate, complete,
consistent, valid, timely, and unique means to your company.
3. Metadata of data fields
What is the correct definition and structure of each data attribute
in your dataset?
This is probably the most important information that you need prior to
your data quality testing process. Metadata is the information that
describes your data. It helps you to understand the descriptive and
structural definition of each data field in your dataset, and hence
measure its impact and quality.
Examples of metadata include the data’s creation date and time, the
purpose of data, source of data, process used to create the data,
creator’s name and so on. Metadata allows you to define why a data
field is being captured in your dataset, its purpose, acceptable value
range, appropriate channel and time for creation, etc., and use that
while testing and measuring data for quality.
How do you check the quality of your data?
Now here’s the part that you’ve been waiting for. Once you’ve prepared and
set the broad testing criteria, you are now ready to begin your testing process.
Metadata of data fields
There are multiple levels of data quality testing depending on the depth and
perspective of the test plan you’re following.
LEVEL 1: Quick fact-checking of data values
Since data is being captured from our surroundings, we can quickly validate
its accuracy by comparing it with known truth. For example, does Age column
contain any negative values; are required Name fields set to null; do Address
field values represent real addresses; does Date column contain correctly
formatted dates; and so on.
This level of testing can be performed by generating a quick data profile of
your dataset. It is a simple compare and label test where your dataset values
are compared against your defined validations and some known/correct
values, and classified as valid or non-valid. Although it can be done manually,
you can also use an automated tool that will a run a quick profile test and
show you where your data stands as compared to the validation rules defined.
But keep in mind that this level only tests the data itself, and not the metadata.
LEVEL 2: Holistic analysis of the dataset
The level-1 testing is focused on validating each individual value present in
the dataset. The next level requires you to consider and test your dataset
more holistically. This means testing your dataset vertically as well as
horizontally. This level of testing is very useful if implemented at data-entry
level as it stops errors from cascading into your dataset.
1. Vertical testing
It means computing the statistical distribution of each data attribute, and
validating that all values are following the distribution. This allows you to
continuously keep in check that the nature of new, incoming data is the
same as the data residing within your dataset.
Furthermore, for this type of testing, you can determine the median and
average values for each distribution, and set minimum and maximum
thresholds. On every new entry to the dataset, you can check the
probability that the new data belongs to this distribution. If the probability
is high enough (approx. 95% or more), you can conclude that the data is
valid and accurate.
You can also use the metadata of an attribute to compute distribution
and test incoming data against it. For example, the Name field usually
contains 7-15 number of characters. If a new Name entry has only 2
characters, it can be considered as a potential error as the new
metadata value did not conform to the expected distribution.
2. Horizontal testing
It means performing a holistic analysis to qualify the uniqueness of each
record in your dataset. For this type of testing, you need to go row by
row in a dataset and verify that all records represent uniquely
identifiable entities, and there are no duplicates present. This is a more
complex form of testing as it might be difficult to assess uniqueness of a
record in the absence of a unique key. For this purpose, advanced
algorithms are utilized for performing fuzzy matching techniques and
determining probabilistic matches. :
LEVEL 3: Historical analysis of the dataset
Level 3 testing is the same as level 2, but instead of considering only current
dataset, historical records are also used for computing row matches, and field
distributions. This is done so that any changes in data that happen with time
are also considered while validating data values.
For example, yearly sales are expected to spike at the end of the year due to
holidays and are comparatively slower in the seasons leading up to it. So, you
can end up drawing incorrect conclusions about your data if you don’t take
time into consideration. With this level, you can also run tests for detecting
anomalies in your data. This is done by looking at the history of values in a
data attribute and classifying current values as normal or abnormal.
Using data quality testing tools and
frameworks
Now that we’ve covered the different levels of data quality testing, let’s look at
the tools and frameworks available out there that can help you implement your
testing process.
1. Manual QA/testing
In traditional data warehouse environments, a data quality test is a
manual verification process. Users manually verify values for data
types, length of characters, formats, and whether the value falls within
an acceptable range. This manual verification does only makes the
processing time-intensive but also makes the testing results prone to
human errors.
2. Open-source libraries
A number of open-source projects are available that can help you to test
your data using various coded functions. Many organizations find these
solutions easily adaptable, but some do require customizations to be
done before they can leverage these tools for their use cases. As these
tools only offer the code for functional scripts, you may need to a
developer to complete the process of reporting test results, or
programming custom alerts every time a data quality rule is violated.
3. Coded solutions built in-house
It is very common for companies to decide on building a custom solution
for any problem that they are facing. And it is no different for data quality
testing. Management either outsources the project or utilizes a team of
in-house developers to understand their data quality control issues and
invests in the implementation of a custom solution. Although the idea of
having a data quality control system build specifically for your
organization’s use case seems attractive, it is usually very difficult to
maintain the validity of such code scripts, as data quality definition
constantly needs review and changes. .
4. Automated self-service tools
As data quality challenges become more complex, modern problems
require modern solutions. Data scientists and data analysts are
spending 80% of their time in testing data quality, and only 20% of the
time in extracting business insights. Automated data quality testing tools
leverage advanced algorithms to free you from manual labor of testing
datasets for quality, or maintaining coded solutions over a period of time
as data quality definitions evolve.
These tools are designed to be self-service and user-friendly so that
anyone – business users, data analysts, IT managers – can generate
quick data profiles as well as perform in-depth analysis of data quality
through proprietary data matching techniques.
Normally, these tools specialize in offering two different types of testing
engines – some come with only one and very few specialize in both
types. Let’s take a look at them.
1) Rules-based engines
Rules-based testing tools allow you to configure rules for validating
datasets against your custom-defined data quality requirements. You
can define rules for different dimensions of a data field. For example, its
length, allowed formats and data types, acceptable range values,
required patterns, and so on. These tools quickly profile your data
against configured rules, and offer a concise data quality summary
report which covers the results of the test.
2) Suggestion-based engines
Suggestions-based testing tools are usually based on machine learning
algorithms. They analyze your current and historical datasets to train
models of data distribution. Next, they test every incoming data value
against the model, and output a data quality suggestion based on the
result. Instead of manually configuring the rules of data quality,
suggestion-based tools suggest you how qualified your data is. This is a
very efficient way of analyzing and capturing anomalies at data-entry
level.
Next course of action: Quality maintenance
Data quality testing is not a static, one-time process. Right when you feel like
you’ve got the quality of your dataset under control, invest in implementing a
long-term plan for quality maintenance. There are different activities that need
to be performed at regular intervals to ensure that the quality achieved is
being maintained. Some of them include:
1. Employ data quality control for data
integration
As new data enters into your ecosystem, the overall quality of your data
deteriorates. This is why you need to implement data quality checks at
the data entry or data integration level. You want to make sure that new
data is introduced into the system is accurate and unique and is not a
duplicate of any entity currently residing in your master record.
2. Profile your data at regular intervals
This is probably one of the most important post-testing activities. You
need to continuously assess the state of your data. This requires you to
run quick profile tests on your dataset at regular intervals to ensure
resolution of errors on time. It is a good practice to store the results of
these profiles over time as they would help you to understand at what
point in time your data quality went south.
3. Fix root cause of identified errors
Keep an eye out on the kind of errors your data profile reports usually
contain. Does your data mostly alert you about incorrect date formats?
Are there null values present for required fields? Maybe you need to fix
your data entry form validations. This activity will help you to eliminate
your data quality errors at the root and will allow you to leverage data
directly for its intended purpose.
To conclude – test data quality before it
gets too late
Most companies don’t engage in data quality tests unless critical for
data migration or a merger, but at that time, it’s way too late to salvage
the problems caused by poor data. Test your data quality, define the
criteria, and set benchmarks to drive improvement.
Want to test your data quality? Give
DME a try!
Luckily, you no longer have to put in the effort of manually testing your
data as most ML-based data quality testing solutions today allow
businesses to do that with a few easy steps. You’re choosing between 2
minutes vs 12 hours. And the choice doesn’t have to be daunting.
Best-in-class solutions like DataMatch Enterprise allow free trials that
you can benefit from. All you have to do is plug in your data source and
let the software guide you through the process. You’ll be surprised at the
hours and manual effort you’d be saving your team with an automated
solution that also delivers more accurate results than manual methods.
Data quality testing – a quick checklist to measure and improve data quality

More Related Content

What's hot

Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?
Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?
Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?DATAVERSITY
 
How to Govern Your Master Data
How to Govern Your Master Data How to Govern Your Master Data
How to Govern Your Master Data DATAVERSITY
 
Slides: Metadata Management for the Governance Minded
Slides: Metadata Management for the Governance MindedSlides: Metadata Management for the Governance Minded
Slides: Metadata Management for the Governance MindedDATAVERSITY
 
RWDG Webinar: Build Your Own Data Governance Tools
RWDG Webinar: Build Your Own Data Governance ToolsRWDG Webinar: Build Your Own Data Governance Tools
RWDG Webinar: Build Your Own Data Governance ToolsDATAVERSITY
 
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...Denodo
 
The Chief Data Officer's Agenda: What a CDO Needs to Know about Data Quality
The Chief Data Officer's Agenda: What a CDO Needs to Know about Data QualityThe Chief Data Officer's Agenda: What a CDO Needs to Know about Data Quality
The Chief Data Officer's Agenda: What a CDO Needs to Know about Data QualityDATAVERSITY
 
Big Challenges in Data Modeling: Modeling Metadata
Big Challenges in Data Modeling: Modeling MetadataBig Challenges in Data Modeling: Modeling Metadata
Big Challenges in Data Modeling: Modeling MetadataDATAVERSITY
 
Data Quality for Non-Data People
Data Quality for Non-Data PeopleData Quality for Non-Data People
Data Quality for Non-Data PeopleDATAVERSITY
 
TDWI checklist 2018 - Data Warehouse Infrastructure
TDWI checklist 2018 - Data Warehouse InfrastructureTDWI checklist 2018 - Data Warehouse Infrastructure
TDWI checklist 2018 - Data Warehouse InfrastructureJeannette Browning
 
Metadata Strategies - Data Squared
Metadata Strategies - Data SquaredMetadata Strategies - Data Squared
Metadata Strategies - Data SquaredDATAVERSITY
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureDATAVERSITY
 
Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...
Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...
Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...DATAVERSITY
 
My role as chief data officer
My role as chief data officerMy role as chief data officer
My role as chief data officerGed Mirfin
 
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data GovernanceRWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data GovernanceDATAVERSITY
 
Applying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScaleApplying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScalePrecisely
 
Slides: Achieving a “Single Source of Truth” with BI in Your Enterprise
Slides: Achieving a “Single Source of Truth” with BI in Your EnterpriseSlides: Achieving a “Single Source of Truth” with BI in Your Enterprise
Slides: Achieving a “Single Source of Truth” with BI in Your EnterpriseDATAVERSITY
 
Data-Centric Analytics and Understanding the Full Data Supply Chain
Data-Centric Analytics and Understanding the Full Data Supply ChainData-Centric Analytics and Understanding the Full Data Supply Chain
Data-Centric Analytics and Understanding the Full Data Supply ChainDATAVERSITY
 
DataEd Slides: Unlock Business Value Using Reference and Master Data Manageme...
DataEd Slides: Unlock Business Value Using Reference and Master Data Manageme...DataEd Slides: Unlock Business Value Using Reference and Master Data Manageme...
DataEd Slides: Unlock Business Value Using Reference and Master Data Manageme...DATAVERSITY
 
DAS Slides: Master Data Management – Aligning Data, Process, and Governance
DAS Slides: Master Data Management – Aligning Data, Process, and GovernanceDAS Slides: Master Data Management – Aligning Data, Process, and Governance
DAS Slides: Master Data Management – Aligning Data, Process, and GovernanceDATAVERSITY
 

What's hot (20)

Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?
Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?
Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?
 
How to Govern Your Master Data
How to Govern Your Master Data How to Govern Your Master Data
How to Govern Your Master Data
 
Slides: Metadata Management for the Governance Minded
Slides: Metadata Management for the Governance MindedSlides: Metadata Management for the Governance Minded
Slides: Metadata Management for the Governance Minded
 
RWDG Webinar: Build Your Own Data Governance Tools
RWDG Webinar: Build Your Own Data Governance ToolsRWDG Webinar: Build Your Own Data Governance Tools
RWDG Webinar: Build Your Own Data Governance Tools
 
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
 
The Chief Data Officer's Agenda: What a CDO Needs to Know about Data Quality
The Chief Data Officer's Agenda: What a CDO Needs to Know about Data QualityThe Chief Data Officer's Agenda: What a CDO Needs to Know about Data Quality
The Chief Data Officer's Agenda: What a CDO Needs to Know about Data Quality
 
Big Challenges in Data Modeling: Modeling Metadata
Big Challenges in Data Modeling: Modeling MetadataBig Challenges in Data Modeling: Modeling Metadata
Big Challenges in Data Modeling: Modeling Metadata
 
Data Quality for Non-Data People
Data Quality for Non-Data PeopleData Quality for Non-Data People
Data Quality for Non-Data People
 
TDWI checklist 2018 - Data Warehouse Infrastructure
TDWI checklist 2018 - Data Warehouse InfrastructureTDWI checklist 2018 - Data Warehouse Infrastructure
TDWI checklist 2018 - Data Warehouse Infrastructure
 
Metadata Strategies - Data Squared
Metadata Strategies - Data SquaredMetadata Strategies - Data Squared
Metadata Strategies - Data Squared
 
Ashish dwivedi
Ashish dwivediAshish dwivedi
Ashish dwivedi
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 
Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...
Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...
Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...
 
My role as chief data officer
My role as chief data officerMy role as chief data officer
My role as chief data officer
 
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data GovernanceRWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
 
Applying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScaleApplying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data Scale
 
Slides: Achieving a “Single Source of Truth” with BI in Your Enterprise
Slides: Achieving a “Single Source of Truth” with BI in Your EnterpriseSlides: Achieving a “Single Source of Truth” with BI in Your Enterprise
Slides: Achieving a “Single Source of Truth” with BI in Your Enterprise
 
Data-Centric Analytics and Understanding the Full Data Supply Chain
Data-Centric Analytics and Understanding the Full Data Supply ChainData-Centric Analytics and Understanding the Full Data Supply Chain
Data-Centric Analytics and Understanding the Full Data Supply Chain
 
DataEd Slides: Unlock Business Value Using Reference and Master Data Manageme...
DataEd Slides: Unlock Business Value Using Reference and Master Data Manageme...DataEd Slides: Unlock Business Value Using Reference and Master Data Manageme...
DataEd Slides: Unlock Business Value Using Reference and Master Data Manageme...
 
DAS Slides: Master Data Management – Aligning Data, Process, and Governance
DAS Slides: Master Data Management – Aligning Data, Process, and GovernanceDAS Slides: Master Data Management – Aligning Data, Process, and Governance
DAS Slides: Master Data Management – Aligning Data, Process, and Governance
 

Similar to Data quality testing – a quick checklist to measure and improve data quality

What is Data Observability.pdf
What is Data Observability.pdfWhat is Data Observability.pdf
What is Data Observability.pdf4dalert
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 
Data quality and bi
Data quality and biData quality and bi
Data quality and bijeffd00
 
AI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfAI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfarifulislam946965
 
Data Quality
Data QualityData Quality
Data QualityVijaya K
 
What Is Data Quality.pdf
What Is Data Quality.pdfWhat Is Data Quality.pdf
What Is Data Quality.pdfscottsamith
 
OberservePoint - The Digital Data Quality Playbook
OberservePoint - The Digital Data Quality  PlaybookOberservePoint - The Digital Data Quality  Playbook
OberservePoint - The Digital Data Quality PlaybookObservePoint
 
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Health Catalyst
 
Developing A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product DataDeveloping A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product DataFindWhitePapers
 
Beyond Firefighting: A Leaders Guide to Proactive Data Quality Management
Beyond Firefighting: A Leaders Guide to Proactive Data Quality ManagementBeyond Firefighting: A Leaders Guide to Proactive Data Quality Management
Beyond Firefighting: A Leaders Guide to Proactive Data Quality ManagementHarley Capewell
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesCarl Anderson
 
InfoSphere_Information_Analyzer
InfoSphere_Information_AnalyzerInfoSphere_Information_Analyzer
InfoSphere_Information_AnalyzerSourav Maity
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 
Lecture 22
Lecture 22Lecture 22
Lecture 22Shani729
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 

Similar to Data quality testing – a quick checklist to measure and improve data quality (20)

Data quality
Data qualityData quality
Data quality
 
Data quality
Data qualityData quality
Data quality
 
What is Data Observability.pdf
What is Data Observability.pdfWhat is Data Observability.pdf
What is Data Observability.pdf
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
Data quality and bi
Data quality and biData quality and bi
Data quality and bi
 
AI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfAI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdf
 
Data Quality
Data QualityData Quality
Data Quality
 
What Is Data Quality.pdf
What Is Data Quality.pdfWhat Is Data Quality.pdf
What Is Data Quality.pdf
 
Business analyst
Business analystBusiness analyst
Business analyst
 
OberservePoint - The Digital Data Quality Playbook
OberservePoint - The Digital Data Quality  PlaybookOberservePoint - The Digital Data Quality  Playbook
OberservePoint - The Digital Data Quality Playbook
 
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
 
Developing A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product DataDeveloping A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product Data
 
Beyond Firefighting: A Leaders Guide to Proactive Data Quality Management
Beyond Firefighting: A Leaders Guide to Proactive Data Quality ManagementBeyond Firefighting: A Leaders Guide to Proactive Data Quality Management
Beyond Firefighting: A Leaders Guide to Proactive Data Quality Management
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practices
 
do_dq.pdf
do_dq.pdfdo_dq.pdf
do_dq.pdf
 
InfoSphere_Information_Analyzer
InfoSphere_Information_AnalyzerInfoSphere_Information_Analyzer
InfoSphere_Information_Analyzer
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Lecture 22
Lecture 22Lecture 22
Lecture 22
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 

Recently uploaded

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 

Recently uploaded (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 

Data quality testing – a quick checklist to measure and improve data quality

  • 1. Data Quality Testing – A Quick Checklist to Measure and Improve Data Quality Did you know? More than 70% of revenue leaders in an InsideView Alignment Report 2020 rank data management as the highest priority, yet, a Harvard Business
  • 2. Review study estimates only 3 percent of companies’ data meets basic quality standards. There is a major gap between what companies want in terms of data quality and what they are doing to fix it. The first step to any data management plan is to test the quality of data and identify some of the core issues that lead to poor data quality. Here’s a quick guide-based checklist to help IT managers, business managers and decision-makers to analyze the quality of their data and what tools and frameworks can help them to make it accurate and reliable. What is data quality and why does it matter? Before we delve into the checklist, here’s a quick briefing on what data quality is and why it matters. There is no specific definition of data quality and to give one would be to limit the scope of data itself. There are however benchmarks that can be used to assess the state of your data. For instance, data of high quality would mean: ● It’s error-free. No typos, no format and structure issues. ● It’s consolidated. Data is not scattered over different systems. ● It’s unique. It is not duplicated. ● It’s timely. The data is not obsolete. ● It’s accurate. You can rely on this data to make business decisions. It’s not mandatory (but is helpful) for your data to be all of this. Data quality matters because:
  • 3. ● Your business is losing money for every inaccurate data field ● Your direct mail & marketing campaigns incur unnecessary costs for every wrong address data field ● You’re making business decisions made on flawed data ● You’re receiving inaccurate insights ● Your data is obsolete and does not fulfill its intended purpose. Put simply, poor data, left neglected impacts every aspect of your business process – from sales to marketing, customer support to customer service, and team efficiency. In recent years, data quality is no longer a backburner process. It’s affecting businesses drastically, which makes it all the more important to treat data quality as a burning issue that needs a resolution before it endangers the growth plan of a business. Pre-requisites of data quality testing Before you can test your data efficiently, it is necessary to define and set the right expectations from the process and the data itself. Let’s look at what you should know before starting your data quality testing process. 1. Purpose of your data What do you want to achieve with your data? Is it supposed to fuel your business intelligence process? Or help you identify new market opportunities and customer segments? Whatever the intended purpose of data is at your company, identify it. If you don’t understand what data can do for you, you’ll never be able to measure whether it is fulfilling its purpose.
  • 4. 2. Data quality metrics What high-quality data means to you? You must understand the metrics that will help you to measure data quality. This could be as simple as the six critical data quality dimensions that we all know so well. But it is better if you make this a bit more specific to your use case. For example, the Date column in a dataset should contain formatted dates only. But you could also have dates that are actually garbage values since they represent dates that are too old to be accurate. So, you could have your own, more specific definition of what accurate, complete, consistent, valid, timely, and unique means to your company. 3. Metadata of data fields What is the correct definition and structure of each data attribute in your dataset? This is probably the most important information that you need prior to your data quality testing process. Metadata is the information that describes your data. It helps you to understand the descriptive and structural definition of each data field in your dataset, and hence measure its impact and quality. Examples of metadata include the data’s creation date and time, the
  • 5. purpose of data, source of data, process used to create the data, creator’s name and so on. Metadata allows you to define why a data field is being captured in your dataset, its purpose, acceptable value range, appropriate channel and time for creation, etc., and use that while testing and measuring data for quality. How do you check the quality of your data? Now here’s the part that you’ve been waiting for. Once you’ve prepared and set the broad testing criteria, you are now ready to begin your testing process. Metadata of data fields There are multiple levels of data quality testing depending on the depth and perspective of the test plan you’re following. LEVEL 1: Quick fact-checking of data values Since data is being captured from our surroundings, we can quickly validate its accuracy by comparing it with known truth. For example, does Age column contain any negative values; are required Name fields set to null; do Address field values represent real addresses; does Date column contain correctly formatted dates; and so on. This level of testing can be performed by generating a quick data profile of your dataset. It is a simple compare and label test where your dataset values are compared against your defined validations and some known/correct
  • 6. values, and classified as valid or non-valid. Although it can be done manually, you can also use an automated tool that will a run a quick profile test and show you where your data stands as compared to the validation rules defined. But keep in mind that this level only tests the data itself, and not the metadata. LEVEL 2: Holistic analysis of the dataset The level-1 testing is focused on validating each individual value present in the dataset. The next level requires you to consider and test your dataset more holistically. This means testing your dataset vertically as well as horizontally. This level of testing is very useful if implemented at data-entry level as it stops errors from cascading into your dataset. 1. Vertical testing It means computing the statistical distribution of each data attribute, and validating that all values are following the distribution. This allows you to continuously keep in check that the nature of new, incoming data is the same as the data residing within your dataset. Furthermore, for this type of testing, you can determine the median and average values for each distribution, and set minimum and maximum thresholds. On every new entry to the dataset, you can check the probability that the new data belongs to this distribution. If the probability is high enough (approx. 95% or more), you can conclude that the data is valid and accurate. You can also use the metadata of an attribute to compute distribution and test incoming data against it. For example, the Name field usually contains 7-15 number of characters. If a new Name entry has only 2
  • 7. characters, it can be considered as a potential error as the new metadata value did not conform to the expected distribution. 2. Horizontal testing It means performing a holistic analysis to qualify the uniqueness of each record in your dataset. For this type of testing, you need to go row by row in a dataset and verify that all records represent uniquely identifiable entities, and there are no duplicates present. This is a more complex form of testing as it might be difficult to assess uniqueness of a record in the absence of a unique key. For this purpose, advanced algorithms are utilized for performing fuzzy matching techniques and determining probabilistic matches. : LEVEL 3: Historical analysis of the dataset Level 3 testing is the same as level 2, but instead of considering only current dataset, historical records are also used for computing row matches, and field distributions. This is done so that any changes in data that happen with time are also considered while validating data values. For example, yearly sales are expected to spike at the end of the year due to holidays and are comparatively slower in the seasons leading up to it. So, you can end up drawing incorrect conclusions about your data if you don’t take time into consideration. With this level, you can also run tests for detecting anomalies in your data. This is done by looking at the history of values in a data attribute and classifying current values as normal or abnormal. Using data quality testing tools and frameworks Now that we’ve covered the different levels of data quality testing, let’s look at
  • 8. the tools and frameworks available out there that can help you implement your testing process. 1. Manual QA/testing In traditional data warehouse environments, a data quality test is a manual verification process. Users manually verify values for data types, length of characters, formats, and whether the value falls within an acceptable range. This manual verification does only makes the processing time-intensive but also makes the testing results prone to human errors. 2. Open-source libraries A number of open-source projects are available that can help you to test your data using various coded functions. Many organizations find these solutions easily adaptable, but some do require customizations to be done before they can leverage these tools for their use cases. As these tools only offer the code for functional scripts, you may need to a developer to complete the process of reporting test results, or programming custom alerts every time a data quality rule is violated. 3. Coded solutions built in-house It is very common for companies to decide on building a custom solution for any problem that they are facing. And it is no different for data quality testing. Management either outsources the project or utilizes a team of in-house developers to understand their data quality control issues and
  • 9. invests in the implementation of a custom solution. Although the idea of having a data quality control system build specifically for your organization’s use case seems attractive, it is usually very difficult to maintain the validity of such code scripts, as data quality definition constantly needs review and changes. . 4. Automated self-service tools As data quality challenges become more complex, modern problems require modern solutions. Data scientists and data analysts are spending 80% of their time in testing data quality, and only 20% of the time in extracting business insights. Automated data quality testing tools leverage advanced algorithms to free you from manual labor of testing datasets for quality, or maintaining coded solutions over a period of time as data quality definitions evolve. These tools are designed to be self-service and user-friendly so that anyone – business users, data analysts, IT managers – can generate quick data profiles as well as perform in-depth analysis of data quality through proprietary data matching techniques. Normally, these tools specialize in offering two different types of testing engines – some come with only one and very few specialize in both types. Let’s take a look at them.
  • 10. 1) Rules-based engines Rules-based testing tools allow you to configure rules for validating datasets against your custom-defined data quality requirements. You can define rules for different dimensions of a data field. For example, its length, allowed formats and data types, acceptable range values, required patterns, and so on. These tools quickly profile your data against configured rules, and offer a concise data quality summary report which covers the results of the test. 2) Suggestion-based engines Suggestions-based testing tools are usually based on machine learning algorithms. They analyze your current and historical datasets to train models of data distribution. Next, they test every incoming data value against the model, and output a data quality suggestion based on the result. Instead of manually configuring the rules of data quality, suggestion-based tools suggest you how qualified your data is. This is a very efficient way of analyzing and capturing anomalies at data-entry level. Next course of action: Quality maintenance Data quality testing is not a static, one-time process. Right when you feel like you’ve got the quality of your dataset under control, invest in implementing a long-term plan for quality maintenance. There are different activities that need to be performed at regular intervals to ensure that the quality achieved is being maintained. Some of them include:
  • 11. 1. Employ data quality control for data integration As new data enters into your ecosystem, the overall quality of your data deteriorates. This is why you need to implement data quality checks at the data entry or data integration level. You want to make sure that new data is introduced into the system is accurate and unique and is not a duplicate of any entity currently residing in your master record. 2. Profile your data at regular intervals This is probably one of the most important post-testing activities. You need to continuously assess the state of your data. This requires you to run quick profile tests on your dataset at regular intervals to ensure resolution of errors on time. It is a good practice to store the results of these profiles over time as they would help you to understand at what point in time your data quality went south. 3. Fix root cause of identified errors Keep an eye out on the kind of errors your data profile reports usually contain. Does your data mostly alert you about incorrect date formats? Are there null values present for required fields? Maybe you need to fix your data entry form validations. This activity will help you to eliminate
  • 12. your data quality errors at the root and will allow you to leverage data directly for its intended purpose. To conclude – test data quality before it gets too late Most companies don’t engage in data quality tests unless critical for data migration or a merger, but at that time, it’s way too late to salvage the problems caused by poor data. Test your data quality, define the criteria, and set benchmarks to drive improvement. Want to test your data quality? Give DME a try! Luckily, you no longer have to put in the effort of manually testing your data as most ML-based data quality testing solutions today allow businesses to do that with a few easy steps. You’re choosing between 2 minutes vs 12 hours. And the choice doesn’t have to be daunting. Best-in-class solutions like DataMatch Enterprise allow free trials that you can benefit from. All you have to do is plug in your data source and let the software guide you through the process. You’ll be surprised at the hours and manual effort you’d be saving your team with an automated solution that also delivers more accurate results than manual methods.