Don't wait for a data migration event to test your data quality. Perform data quality tests now before it gets too late. Here's everything you need to know!
https://dataladder.com/data-quality-test-checklist/
Data quality testing – a quick checklist to measure and improve data quality
1. Data Quality Testing – A Quick Checklist to
Measure and Improve Data Quality
Did you know?
More than 70% of revenue leaders in an InsideView Alignment Report 2020
rank data management as the highest priority, yet, a Harvard Business
2. Review study estimates only 3 percent of companies’ data meets basic quality
standards.
There is a major gap between what companies want in terms of data quality
and what they are doing to fix it.
The first step to any data management plan is to test the quality of data and
identify some of the core issues that lead to poor data quality. Here’s a quick
guide-based checklist to help IT managers, business managers and
decision-makers to analyze the quality of their data and what tools and
frameworks can help them to make it accurate and reliable.
What is data quality and why does it
matter?
Before we delve into the checklist, here’s a quick briefing on what data quality
is and why it matters.
There is no specific definition of data quality and to give one would be to limit
the scope of data itself. There are however benchmarks that can be used to
assess the state of your data. For instance, data of high quality would mean:
● It’s error-free. No typos, no format and structure issues.
● It’s consolidated. Data is not scattered over different systems.
● It’s unique. It is not duplicated.
● It’s timely. The data is not obsolete.
● It’s accurate. You can rely on this data to make business decisions.
It’s not mandatory (but is helpful) for your data to be all of this. Data quality
matters because:
3. ● Your business is losing money for every inaccurate data field
● Your direct mail & marketing campaigns incur unnecessary costs for
every wrong address data field
● You’re making business decisions made on flawed data
● You’re receiving inaccurate insights
● Your data is obsolete and does not fulfill its intended purpose.
Put simply, poor data, left neglected impacts every aspect of your business
process – from sales to marketing, customer support to customer service, and
team efficiency. In recent years, data quality is no longer a backburner
process. It’s affecting businesses drastically, which makes it all the more
important to treat data quality as a burning issue that needs a resolution
before it endangers the growth plan of a business.
Pre-requisites of data quality testing
Before you can test your data efficiently, it is necessary to define and set the
right expectations from the process and the data itself. Let’s look at what you
should know before starting your data quality testing process.
1. Purpose of your data
What do you want to achieve with your data?
Is it supposed to fuel your business intelligence process? Or help you
identify new market opportunities and customer segments? Whatever
the intended purpose of data is at your company, identify it. If you don’t
understand what data can do for you, you’ll never be able to measure
whether it is fulfilling its purpose.
4. 2. Data quality metrics
What high-quality data means to you?
You must understand the metrics that will help you to measure data
quality. This could be as simple as the six critical data quality
dimensions that we all know so well. But it is better if you make this a bit
more specific to your use case. For example, the Date column in a
dataset should contain formatted dates only. But you could also have
dates that are actually garbage values since they represent dates that
are too old to be accurate. So, you could have
your own, more specific definition of what accurate, complete,
consistent, valid, timely, and unique means to your company.
3. Metadata of data fields
What is the correct definition and structure of each data attribute
in your dataset?
This is probably the most important information that you need prior to
your data quality testing process. Metadata is the information that
describes your data. It helps you to understand the descriptive and
structural definition of each data field in your dataset, and hence
measure its impact and quality.
Examples of metadata include the data’s creation date and time, the
5. purpose of data, source of data, process used to create the data,
creator’s name and so on. Metadata allows you to define why a data
field is being captured in your dataset, its purpose, acceptable value
range, appropriate channel and time for creation, etc., and use that
while testing and measuring data for quality.
How do you check the quality of your data?
Now here’s the part that you’ve been waiting for. Once you’ve prepared and
set the broad testing criteria, you are now ready to begin your testing process.
Metadata of data fields
There are multiple levels of data quality testing depending on the depth and
perspective of the test plan you’re following.
LEVEL 1: Quick fact-checking of data values
Since data is being captured from our surroundings, we can quickly validate
its accuracy by comparing it with known truth. For example, does Age column
contain any negative values; are required Name fields set to null; do Address
field values represent real addresses; does Date column contain correctly
formatted dates; and so on.
This level of testing can be performed by generating a quick data profile of
your dataset. It is a simple compare and label test where your dataset values
are compared against your defined validations and some known/correct
6. values, and classified as valid or non-valid. Although it can be done manually,
you can also use an automated tool that will a run a quick profile test and
show you where your data stands as compared to the validation rules defined.
But keep in mind that this level only tests the data itself, and not the metadata.
LEVEL 2: Holistic analysis of the dataset
The level-1 testing is focused on validating each individual value present in
the dataset. The next level requires you to consider and test your dataset
more holistically. This means testing your dataset vertically as well as
horizontally. This level of testing is very useful if implemented at data-entry
level as it stops errors from cascading into your dataset.
1. Vertical testing
It means computing the statistical distribution of each data attribute, and
validating that all values are following the distribution. This allows you to
continuously keep in check that the nature of new, incoming data is the
same as the data residing within your dataset.
Furthermore, for this type of testing, you can determine the median and
average values for each distribution, and set minimum and maximum
thresholds. On every new entry to the dataset, you can check the
probability that the new data belongs to this distribution. If the probability
is high enough (approx. 95% or more), you can conclude that the data is
valid and accurate.
You can also use the metadata of an attribute to compute distribution
and test incoming data against it. For example, the Name field usually
contains 7-15 number of characters. If a new Name entry has only 2
7. characters, it can be considered as a potential error as the new
metadata value did not conform to the expected distribution.
2. Horizontal testing
It means performing a holistic analysis to qualify the uniqueness of each
record in your dataset. For this type of testing, you need to go row by
row in a dataset and verify that all records represent uniquely
identifiable entities, and there are no duplicates present. This is a more
complex form of testing as it might be difficult to assess uniqueness of a
record in the absence of a unique key. For this purpose, advanced
algorithms are utilized for performing fuzzy matching techniques and
determining probabilistic matches. :
LEVEL 3: Historical analysis of the dataset
Level 3 testing is the same as level 2, but instead of considering only current
dataset, historical records are also used for computing row matches, and field
distributions. This is done so that any changes in data that happen with time
are also considered while validating data values.
For example, yearly sales are expected to spike at the end of the year due to
holidays and are comparatively slower in the seasons leading up to it. So, you
can end up drawing incorrect conclusions about your data if you don’t take
time into consideration. With this level, you can also run tests for detecting
anomalies in your data. This is done by looking at the history of values in a
data attribute and classifying current values as normal or abnormal.
Using data quality testing tools and
frameworks
Now that we’ve covered the different levels of data quality testing, let’s look at
8. the tools and frameworks available out there that can help you implement your
testing process.
1. Manual QA/testing
In traditional data warehouse environments, a data quality test is a
manual verification process. Users manually verify values for data
types, length of characters, formats, and whether the value falls within
an acceptable range. This manual verification does only makes the
processing time-intensive but also makes the testing results prone to
human errors.
2. Open-source libraries
A number of open-source projects are available that can help you to test
your data using various coded functions. Many organizations find these
solutions easily adaptable, but some do require customizations to be
done before they can leverage these tools for their use cases. As these
tools only offer the code for functional scripts, you may need to a
developer to complete the process of reporting test results, or
programming custom alerts every time a data quality rule is violated.
3. Coded solutions built in-house
It is very common for companies to decide on building a custom solution
for any problem that they are facing. And it is no different for data quality
testing. Management either outsources the project or utilizes a team of
in-house developers to understand their data quality control issues and
9. invests in the implementation of a custom solution. Although the idea of
having a data quality control system build specifically for your
organization’s use case seems attractive, it is usually very difficult to
maintain the validity of such code scripts, as data quality definition
constantly needs review and changes. .
4. Automated self-service tools
As data quality challenges become more complex, modern problems
require modern solutions. Data scientists and data analysts are
spending 80% of their time in testing data quality, and only 20% of the
time in extracting business insights. Automated data quality testing tools
leverage advanced algorithms to free you from manual labor of testing
datasets for quality, or maintaining coded solutions over a period of time
as data quality definitions evolve.
These tools are designed to be self-service and user-friendly so that
anyone – business users, data analysts, IT managers – can generate
quick data profiles as well as perform in-depth analysis of data quality
through proprietary data matching techniques.
Normally, these tools specialize in offering two different types of testing
engines – some come with only one and very few specialize in both
types. Let’s take a look at them.
10. 1) Rules-based engines
Rules-based testing tools allow you to configure rules for validating
datasets against your custom-defined data quality requirements. You
can define rules for different dimensions of a data field. For example, its
length, allowed formats and data types, acceptable range values,
required patterns, and so on. These tools quickly profile your data
against configured rules, and offer a concise data quality summary
report which covers the results of the test.
2) Suggestion-based engines
Suggestions-based testing tools are usually based on machine learning
algorithms. They analyze your current and historical datasets to train
models of data distribution. Next, they test every incoming data value
against the model, and output a data quality suggestion based on the
result. Instead of manually configuring the rules of data quality,
suggestion-based tools suggest you how qualified your data is. This is a
very efficient way of analyzing and capturing anomalies at data-entry
level.
Next course of action: Quality maintenance
Data quality testing is not a static, one-time process. Right when you feel like
you’ve got the quality of your dataset under control, invest in implementing a
long-term plan for quality maintenance. There are different activities that need
to be performed at regular intervals to ensure that the quality achieved is
being maintained. Some of them include:
11. 1. Employ data quality control for data
integration
As new data enters into your ecosystem, the overall quality of your data
deteriorates. This is why you need to implement data quality checks at
the data entry or data integration level. You want to make sure that new
data is introduced into the system is accurate and unique and is not a
duplicate of any entity currently residing in your master record.
2. Profile your data at regular intervals
This is probably one of the most important post-testing activities. You
need to continuously assess the state of your data. This requires you to
run quick profile tests on your dataset at regular intervals to ensure
resolution of errors on time. It is a good practice to store the results of
these profiles over time as they would help you to understand at what
point in time your data quality went south.
3. Fix root cause of identified errors
Keep an eye out on the kind of errors your data profile reports usually
contain. Does your data mostly alert you about incorrect date formats?
Are there null values present for required fields? Maybe you need to fix
your data entry form validations. This activity will help you to eliminate
12. your data quality errors at the root and will allow you to leverage data
directly for its intended purpose.
To conclude – test data quality before it
gets too late
Most companies don’t engage in data quality tests unless critical for
data migration or a merger, but at that time, it’s way too late to salvage
the problems caused by poor data. Test your data quality, define the
criteria, and set benchmarks to drive improvement.
Want to test your data quality? Give
DME a try!
Luckily, you no longer have to put in the effort of manually testing your
data as most ML-based data quality testing solutions today allow
businesses to do that with a few easy steps. You’re choosing between 2
minutes vs 12 hours. And the choice doesn’t have to be daunting.
Best-in-class solutions like DataMatch Enterprise allow free trials that
you can benefit from. All you have to do is plug in your data source and
let the software guide you through the process. You’ll be surprised at the
hours and manual effort you’d be saving your team with an automated
solution that also delivers more accurate results than manual methods.