In this fast-paced data-driven world, the fallout from a single data quality issue can cost thousands of dollars in a matter of hours. To catch these issues quickly, system monitoring for data quality requires a different set of strategies from other continuous regression efforts. Like a race car pit crew, you need detection mechanisms that not only don’t interfere with what you are monitoring but also allow for strategic analysis off-track. You need to use every second your subject is at rest to repair and clean up problems that could affect performance. As the systems in race cars vary, the tools and resources available to the data quality professional vary from one organization to the next. You need to be able to leverage the tools at hand to implement your solutions. Shauna Ayers and Catherine Cruz Agosto show you how to develop testing strategies to detect issues with data integration timing, operational dependencies, reference data management, and data integrity—even in production systems. See how you can leverage this testing to provide proactive notification alerts and feed business intelligence dashboards to communicate the health of your organization’s data systems to both operation support and non-technical personnel.
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Data Quality at the Speed of Work
1.
T5
Test
Data
Management
5/11/17
9:45
Data
Quality
at
the
Speed
of
Work
Presented
by:
Shauna
Ayers
Catherine
Cruz
Agosto
Availity
Brought
to
you
by:
350
Corporate
Way,
Suite
400,
Orange
Park,
FL
32073
888-‐-‐-‐268-‐-‐-‐8770
·∙·∙
904-‐-‐-‐278-‐-‐-‐0524
-‐
info@techwell.com
-‐
http://www.starwest.techwell.com/
2.
Shauna
Ayers
Shauna
Ayers
has
been
untangling
the
Gordian
knots
of
IT
systems
for
more
than
seventeen
years,
analyzing
data
systems
and
testing
both
software
and
data
quality
in
the
manufacturing,
medical
device,
and
healthcare
industries.
Shauna
found
her
passion
in
developing
creative
solutions
for
the
analysis
and
testing
of
sensitive
and
highly
regulated
data
sets
at
industry
leaders
such
as
Blue
Cross
Blue
Shield
of
Florida
(now
Florida
Blue),
Vistakon
(a
subsidiary
of
Johnson
&
Johnson),
and
Availity.
Catherine
Cruz
Agosto
Catherine
Cruz
Agosto
found
her
software
engineering
experience
at
Baxter
Healthcare
and
Boeing-‐subsidiary
Insitu
provided
an
excellent
foundation
for
finding
more
effective
and
user-‐friendly
approaches
to
complex
technical
problems.
Catherine
has
developed
more
efficient
and
innovative
data
quality
testing
solutions
at
healthcare
intermediary
Availity,
expanding
their
automated
data
quality
testing
processes
to
accommodate
diverse
and
dissimilar
data
sources,
thus
facilitating
analysis,
testing,
and
controls
for
data
integration,
analytics,
and
healthcare
data
reporting.
3. Data
Quality
at
the
Speed
of
Work
By
Shauna
Ayers
and
Catherine
Cruz
Agosto
4. Overview
• Definitions
• Why is this important?
• What strategies can we use?
• What benefits do these activities bring us?
• What tools do we use?
• Case Studies
• Communication
• Conclusion
5. Defini+ons
● Data quality (DQ) is data's fitness and
usability for its intended purpose.
● Data quality assurance is the monitoring
and analysis of data sets and the
processes that create or manipulate data,
in order to ensure the data’s quality meets
the company's needs.
● DQ Issue: Incorrect or unexpected
behavior from the data as a result of
unknown data scenario, upstream change,
flaw in logic, missing requirements, etc.
○ Timing Issue: A type of issue/defect
in which the root cause stems from
the timing between two or more
components of the system that
depend on each other.
6. Why
is
this
important?
• Consumers expect data to be instantly available
• Consumers expect near-zero downtime
• Automation and algorithmic transactions cause a small
data issue to snowball quickly
• If consumers don’t feel they can trust your data, they
won’t be your customers for long
7. What
strategies
can
we
use?
● Types of Testing
○ Exploratory
○ Manual
○ Automated
● Continuous Regression
○ Production Monitoring
vs Monitoring Lower
Environments
● Continuous Data Profiling
8. What
strategies
can
we
use?
(con6nued)
● Types of Checks and how to use them to identify timing issues
○ Business Rule Validations: Type of test that verifies all of the
acceptance criteria by comparing the source data to the target
data.
■ This type of check catches any discrepancies or deviations
from the acceptance criteria.
○ Null Checks: Type of test that verifies key fields are not null
■ Verify that fields that are expected to be populated are done
so from the initial write, instead of as an update later on.
○ Duplicate Checks: Type of test that checks for any unexpected
duplication of records, typically by use of alternate key.
■ Can be used to spot duplications that are created over time.
9. What
strategies
can
we
use?
(con6nued)
● More types of Checks and how to use them
○ Environment Checks: Type of test that verifies if the process run is
within tolerance.
■ Can be used to identify if and when process is running behind,
which can explain any data issues with downstream processes.
○ Count Checks: Type of tests that compares the count of records in
the source to the count of records in the target.
■ Timing issue could be a potential cause for count mismatch.
○ Compare Checks: Type of tests that compares the alternate key of
records in the source to the alternate key of records in the target.
■ A mismatch in data could indicate potential timing issue
■ Can use compare check to get the details on a count check
discrepancy
10. What
strategies
can
we
use?
(con6nued)
● Even more types of checks and how to use them
○ Domain Integrity Checks: Type of test that verifies the values
used in specified field exist in the corresponding code set.
■ Could indicate discrepancy between timing of value added to
code set and use of code value.
○ System Version Checks: Type of test that checks when there are
changes to the version the system is running on.
■ Changes and/or updates to system versions can cause
unexpected issues such as difference in process behavior,
difference in system clocks, etc.
11. What
benefits
do
these
ac+vi+es
bring
us?
• Opportunity to fix issues before the customer sees or
reports them
• Faster localization of root causes
• Better visibility of chronic issues rooted in timing and
environment
• Better visibility of changes in input profiles
• Cleaner integration with existing operational support
12. What
tools
do
we
use?
● Buying DQ testing software
o Common tools: Informatica
Data Quality, Datamartist,
Microsoft Data Profiling Task
o All tools have some sort of
limitations
o Can get expensive
● Creating custom test harnesses
o Seems more time consuming
up-front
o More control/ less limitations
compared
to pre-bought
● Machine cannot replace a human
13. Case
Studies:
Data
Integra+on
Timing
● Definition: The timing of ETL processes in relation to each other and the
supporting systems they depend on. Risks affect execution order,
dependencies, and load rule boundaries across processes.
● Useful Checks:
o Count/ Compare checks
o Tolerance/Threshold checks
(includes cycle time checks)
o Environment checks
o Business Rule Validations
● Case Studies
o Hybrid systems – the
velocity/dependency trap
o Clock syncs sink ships
o Who watches the watchmen?
o Surge Protection
14. Case
Studies:
Opera+onal
Dependencies
● Definition: Two or more
processes of a system or
components of a process that
rely on each other.
● Useful Checks:
○ Codesets
○ BRV
○ Null Checks
○ System Version Checks
○ Count/ Compare checks
○ Environment Checks
● Case Studies
○ Rocket Failure
○ Data Warehousing
○ UI to Backend
15. Case
Studies:
Reference
Data
Management
● Definition: Reference values are used to drive categorization, routing and
filtering, and may provide part of the focus for dimensional data. They are
normally controlled data sets. Some
● Useful Checks:
o Domain checks
o Tolerance/Threshold checks
o Consistency checks
● Case Studies
o Point-of-Use Domain Checks
o Rate of Dimensional Growth (runaway conditions in the content)
o Process violations
16. Case
Studies:
Data
Integrity
● Definition: The correctness of
the data in or outputted from the
system
● Useful Checks:
o BRV
o Null checks
o Domain Checks
o Null Checks
o Duplicate Checks
o Count/ Compare Checks
o Environment Checks
● Case Studies
o Transaction Processing
o Reporting
17. Communica6on:
Proac+ve
No+fica+on
Alerts
• Automated
no+fica+on
mechanisms
can
be
integrated
easily
with
exis+ng
opera+onal
alert
mechanisms
(e.g.,
pager
duty)
• No+fica+ons
and
alerts
can
be
tailored
to
support
and
reinforce
data
stewardship
18. Communica6on:
Business
Intelligence
Dashboards
● External Dashboards
○ Potential Users: Customers, Production Support, Customer Service,
Business
● Internal Dashboards
○ Display more granular data regarding processes and/ or tests
○ Drill-through
19. Communica6on:
Trends
Analysis
● Performance and tolerance
checks over time reveal cyclic
impacts from maintenance
activities or correlation of
surges in quality issues to
specific business activities.
These drive preventive
measures, capacity planning
and performance tuning.
20. Conclusion
● Proactive data quality saves an
organization time and money.
● Data is the fastest changing
element of an organization; there
is no cookie cutter way of
monitoring or testing, but there
are known strategies that can be
used to help maneuver the
course.
● Metadata about data quality
testing can be used to
communicate issues faster, more
easily target the correct parties,
and provide insights as to the
health of the systems that drive
the organization.