Building a Robust Big Data QA Ecosystem to Mitigate Data Integrity Challenges
White Paper Data Quality Process Design For Ad Hoc Reporting
1. A McKeel Research LLC White Paper
455 Newport Way Suite 103
Issaquah, Washington 98027
(425) 996-0427
Data Quality Process Design
for Ad-Hoc Reporting
By Jim Atwater, Principal Consultant
Management Analytics Practice
September 2008
2. Data Quality Process Design for Ad-Hoc Reporting
McKeel Research, LLC
All rights reserved
Introduction
Contents
This white paper provides an overview of
some of the key objects contained within a
Introduction 2
baseline data-cleansing subsystem for use by
Problem Statement 2
ad-hoc reporting solutions, be they
Previous Options 2
relational, dimensional or somewhere in
Our Solution 3 between. The key scenario is based on
experience in enterprise sales and marketing
Implementation 3
work groups responsible for metrics and
Summary 3
analytics.
Problem Statement
Business organizations have come to realize
the value of dimensional data modeling.
This is particularly the case when it comes
to the “one version of the truth” level of
rigor such systems bring to issues of data
quality. Unfortunately, complexity inherent
in a proper data warehouse implementation
puts such tactics outside the reach of many
sales and marketing workgroups, even in
large enterprise organizations. Barriers
include lack of skilled resources, time and
commitment required in the analysis phase,
and expense compared to relationally-based
legacy ad-hoc reporting solutions.
Previous Options
Legacy relational solutions typically build
reporting solutions directly on source-
system data. Data cleansing and auditing is
typically compiled after the fact by analysts
as footnotes to the reports. This practice
wastes time, causes errors, and leaves a rich
source of analytical information untapped.
As such workgroups evolve, the most
common errors tend to surface by virtue of
their repetition and lead to “fixes” in the
reports themselves, usually along the lines
of computations within the reports that only
serve to obfuscate the source data.
September 2008
3. Data Quality Process Design for Ad-Hoc Reporting
McKeel Research, LLC
All rights reserved
Our Solution This simple benefit guarantees one version
of the truth while maintaining an informed
Our solution is to leverage key data
level of trust that is otherwise mixed into the
quality aspects of the transform
reporting data stream.
procedures detailed by the Kimball
Group for enterprise data
warehousing solutions. This
Data Warehouse “Glide Path”
approach provides three key benefits:
More robust data quality By implementing the accepted best practice
Integrity of the source system for data quality in the data warehousing
field, workgroups have armed themselves
data
metadata that is easily understood by data
A “glide path” toward the
warehouse implementers. More importantly,
data warehouse
they have purchased for themselves a “seat
at the table” in future cost containment and
report centralization efforts.
Data Quality Benefits
The basis of our solution lies in a
metadata store of specific screens,
Implementation
each of which serves to quantify
Implementation of the solution is designed
specific aspects of each data record.
to fit into the existing workflow of a typical
Screens can enforce column
sales or marketing analytics team.
properties within each record, the
Automation of the existing reports and the
structural relation of columns to each
standard “what decisions do you make using
other, or logical business rules that
this data” kinds of analysis form the normal
check individual or aggregate data
weekly workflow. These efforts lead to the
values. The upshot is a data quality
screen definitions.
score that is applied to each record.
This effort is actualized by the baseline data
The added value is that data quality
quality code within the Microsoft SQL
metrics are an authentic data source.
Server Integration Services (SSIS) toolset.
They guide both report owners and
Once the codebase is in place, the screens
producers to concentrate data
are brought to bear and the key error and
cleansing efforts on the source
audit deliverables mature naturally over
systems where they belong.
time.
Source System Data Integrity
Summary
Data integrity is preserved in a
Data quality is something all ad-hoc
pristine state by virtue of the
reporting systems do at some point. Ideally,
separation of data between the
before your V.P. pitches a fit in the middle
source systems and the QA screen
of a big meeting. By building in a metadata-
metrics. Chiefly, the QA metrics
driven data screening facility, this solution
take the form of an audit dimension
adds auditing and error handling to the
whose columns can be either
existing reporting and pays tangible
integrated into existing report queries
dividends going forward.
or delivered separately in the
resulting workbook or deck.