T5	
  
Test	
  Data	
  Management	
  
5/11/17	
  9:45	
  
	
  
	
  
	
  
	
  
	
  
Data	
  Quality	
  at	
  the	
  Speed	
  of	
  Work	
  
	
  
Presented	
  by:	
  	
  
	
  
	
   Shauna	
  Ayers	
  
Catherine	
  Cruz	
  Agosto	
  
	
  
Availity	
  	
  
	
  
Brought	
  to	
  you	
  by:	
  	
  
	
  	
  
	
  
	
  
	
  
	
  
350	
  Corporate	
  Way,	
  Suite	
  400,	
  Orange	
  Park,	
  FL	
  32073	
  	
  
888-­‐-­‐-­‐268-­‐-­‐-­‐8770	
  ·∙·∙	
  904-­‐-­‐-­‐278-­‐-­‐-­‐0524	
  -­‐	
  info@techwell.com	
  -­‐	
  http://www.starwest.techwell.com/	
  	
  	
  
 
	
  	
  
	
  
Shauna	
  Ayers	
  
	
  
Shauna	
  Ayers	
  has	
  been	
  untangling	
  the	
  Gordian	
  knots	
  of	
  IT	
  systems	
  for	
  more	
  than	
  
seventeen	
  years,	
  analyzing	
  data	
  systems	
  and	
  testing	
  both	
  software	
  and	
  data	
  quality	
  
in	
  the	
  manufacturing,	
  medical	
  device,	
  and	
  healthcare	
  industries.	
  Shauna	
  found	
  her	
  
passion	
  in	
  developing	
  creative	
  solutions	
  for	
  the	
  analysis	
  and	
  testing	
  of	
  sensitive	
  and	
  
highly	
  regulated	
  data	
  sets	
  at	
  industry	
  leaders	
  such	
  as	
  Blue	
  Cross	
  Blue	
  Shield	
  of	
  
Florida	
  (now	
  Florida	
  Blue),	
  Vistakon	
  (a	
  subsidiary	
  of	
  Johnson	
  &	
  Johnson),	
  and	
  
Availity.	
  
	
  
Catherine	
  Cruz	
  Agosto	
  
	
  
Catherine	
  Cruz	
  Agosto	
  found	
  her	
  software	
  engineering	
  experience	
  at	
  Baxter	
  
Healthcare	
  and	
  Boeing-­‐subsidiary	
  Insitu	
  provided	
  an	
  excellent	
  foundation	
  for	
  
finding	
  more	
  effective	
  and	
  user-­‐friendly	
  approaches	
  to	
  complex	
  technical	
  problems.	
  
Catherine	
  has	
  developed	
  more	
  efficient	
  and	
  innovative	
  data	
  quality	
  testing	
  solutions	
  
at	
  healthcare	
  intermediary	
  Availity,	
  expanding	
  their	
  automated	
  data	
  quality	
  testing	
  
processes	
  to	
  accommodate	
  diverse	
  and	
  dissimilar	
  data	
  sources,	
  thus	
  facilitating	
  
analysis,	
  testing,	
  and	
  controls	
  for	
  data	
  integration,	
  analytics,	
  and	
  healthcare	
  data	
  
reporting.	
  
Data	
  Quality	
  at	
  the	
  
Speed	
  of	
  Work	
  
	
  
By	
  Shauna	
  Ayers	
  and	
  Catherine	
  Cruz	
  Agosto	
  
Overview	
  
•  Definitions
•  Why is this important?
•  What strategies can we use?
•  What benefits do these activities bring us?
•  What tools do we use?
•  Case Studies
•  Communication
•  Conclusion
	
  
Defini+ons	
  
●  Data quality (DQ) is data's fitness and
usability for its intended purpose.
●  Data quality assurance is the monitoring
and analysis of data sets and the
processes that create or manipulate data,
in order to ensure the data’s quality meets
the company's needs.
●  DQ Issue: Incorrect or unexpected
behavior from the data as a result of
unknown data scenario, upstream change,
flaw in logic, missing requirements, etc.
○  Timing Issue: A type of issue/defect
in which the root cause stems from
the timing between two or more
components of the system that
depend on each other.
Why	
  is	
  this	
  important?	
  
•  Consumers expect data to be instantly available
•  Consumers expect near-zero downtime
•  Automation and algorithmic transactions cause a small
data issue to snowball quickly
•  If consumers don’t feel they can trust your data, they
won’t be your customers for long
	
  
What	
  strategies	
  can	
  we	
  use?	
  
●  Types of Testing
○  Exploratory
○  Manual
○  Automated
●  Continuous Regression
○  Production Monitoring
vs Monitoring Lower
Environments
●  Continuous Data Profiling
What	
  strategies	
  can	
  we	
  use?	
  	
  
(con6nued)	
  
●  Types of Checks and how to use them to identify timing issues
○  Business Rule Validations: Type of test that verifies all of the
acceptance criteria by comparing the source data to the target
data.
■  This type of check catches any discrepancies or deviations
from the acceptance criteria.
○  Null Checks: Type of test that verifies key fields are not null
■  Verify that fields that are expected to be populated are done
so from the initial write, instead of as an update later on.
○  Duplicate Checks: Type of test that checks for any unexpected
duplication of records, typically by use of alternate key.
■  Can be used to spot duplications that are created over time.
What	
  strategies	
  can	
  we	
  use?	
  	
  
(con6nued)	
  
●  More types of Checks and how to use them
○  Environment Checks: Type of test that verifies if the process run is
within tolerance.
■  Can be used to identify if and when process is running behind,
which can explain any data issues with downstream processes.
○  Count Checks: Type of tests that compares the count of records in
the source to the count of records in the target.
■  Timing issue could be a potential cause for count mismatch.
○  Compare Checks: Type of tests that compares the alternate key of
records in the source to the alternate key of records in the target.
■  A mismatch in data could indicate potential timing issue
■  Can use compare check to get the details on a count check
discrepancy
What	
  strategies	
  can	
  we	
  use?	
  	
  
(con6nued)	
  
●  Even more types of checks and how to use them
○  Domain Integrity Checks: Type of test that verifies the values
used in specified field exist in the corresponding code set.
■  Could indicate discrepancy between timing of value added to
code set and use of code value.
○  System Version Checks: Type of test that checks when there are
changes to the version the system is running on.
■  Changes and/or updates to system versions can cause
unexpected issues such as difference in process behavior,
difference in system clocks, etc.
What	
  benefits	
  do	
  these	
  ac+vi+es	
  
bring	
  us?	
  
•  Opportunity to fix issues before the customer sees or
reports them
•  Faster localization of root causes
•  Better visibility of chronic issues rooted in timing and
environment
•  Better visibility of changes in input profiles
•  Cleaner integration with existing operational support
	
  
What	
  tools	
  do	
  we	
  use?	
  
●  Buying DQ testing software
o  Common tools: Informatica
Data Quality, Datamartist,
Microsoft Data Profiling Task
o  All tools have some sort of
limitations
o  Can get expensive
●  Creating custom test harnesses
o  Seems more time consuming
up-front
o  More control/ less limitations
compared
to pre-bought
●  Machine cannot replace a human
Case	
  Studies:	
  	
  
Data	
  Integra+on	
  Timing	
  
●  Definition: The timing of ETL processes in relation to each other and the
supporting systems they depend on. Risks affect execution order,
dependencies, and load rule boundaries across processes.
●  Useful Checks:
o  Count/ Compare checks
o  Tolerance/Threshold checks
(includes cycle time checks)
o  Environment checks
o  Business Rule Validations
●  Case Studies
o  Hybrid systems – the
velocity/dependency trap
o  Clock syncs sink ships
o  Who watches the watchmen?
o  Surge Protection
	
  
Case	
  Studies:	
  	
  
Opera+onal	
  Dependencies	
  
●  Definition: Two or more
processes of a system or
components of a process that
rely on each other.
●  Useful Checks:
○  Codesets
○  BRV
○  Null Checks
○  System Version Checks
○  Count/ Compare checks
○  Environment Checks
●  Case Studies
○  Rocket Failure
○  Data Warehousing
○  UI to Backend
Case	
  Studies:	
  	
  
Reference	
  Data	
  Management	
  
●  Definition: Reference values are used to drive categorization, routing and
filtering, and may provide part of the focus for dimensional data. They are
normally controlled data sets. Some
●  Useful Checks:
o  Domain checks
o  Tolerance/Threshold checks
o  Consistency checks
●  Case Studies
o  Point-of-Use Domain Checks
o  Rate of Dimensional Growth (runaway conditions in the content)
o  Process violations
	
  
	
  
Case	
  Studies:	
  	
  
Data	
  Integrity	
  
●  Definition: The correctness of
the data in or outputted from the
system
●  Useful Checks:
o  BRV
o  Null checks
o  Domain Checks
o  Null Checks
o  Duplicate Checks
o  Count/ Compare Checks
o  Environment Checks
●  Case Studies
o  Transaction Processing
o  Reporting
Communica6on:	
  	
  
Proac+ve	
  No+fica+on	
  Alerts	
  
•  Automated	
  no+fica+on	
  
mechanisms	
  can	
  be	
  
integrated	
  easily	
  with	
  
exis+ng	
  opera+onal	
  
alert	
  mechanisms	
  (e.g.,	
  
pager	
  duty)	
  
•  No+fica+ons	
  and	
  alerts	
  
can	
  be	
  tailored	
  to	
  
support	
  and	
  reinforce	
  
data	
  stewardship	
  
	
  
Communica6on:	
  	
  
Business	
  Intelligence	
  Dashboards	
  
●  External Dashboards
○  Potential Users: Customers, Production Support, Customer Service,
Business
●  Internal Dashboards
○  Display more granular data regarding processes and/ or tests
○  Drill-through
Communica6on:	
  	
  
Trends	
  Analysis	
  
●  Performance and tolerance
checks over time reveal cyclic
impacts from maintenance
activities or correlation of
surges in quality issues to
specific business activities.
These drive preventive
measures, capacity planning
and performance tuning.
	
  
Conclusion	
  
●  Proactive data quality saves an
organization time and money.
●  Data is the fastest changing
element of an organization; there
is no cookie cutter way of
monitoring or testing, but there
are known strategies that can be
used to help maneuver the
course.
●  Metadata about data quality
testing can be used to
communicate issues faster, more
easily target the correct parties,
and provide insights as to the
health of the systems that drive
the organization.
Ques+ons?	
  

Data Quality at the Speed of Work

  • 1.
                    T5   Test  Data  Management   5/11/17  9:45             Data  Quality  at  the  Speed  of  Work     Presented  by:         Shauna  Ayers   Catherine  Cruz  Agosto     Availity       Brought  to  you  by:                 350  Corporate  Way,  Suite  400,  Orange  Park,  FL  32073     888-­‐-­‐-­‐268-­‐-­‐-­‐8770  ·∙·∙  904-­‐-­‐-­‐278-­‐-­‐-­‐0524  -­‐  info@techwell.com  -­‐  http://www.starwest.techwell.com/      
  • 2.
            Shauna  Ayers     Shauna  Ayers  has  been  untangling  the  Gordian  knots  of  IT  systems  for  more  than   seventeen  years,  analyzing  data  systems  and  testing  both  software  and  data  quality   in  the  manufacturing,  medical  device,  and  healthcare  industries.  Shauna  found  her   passion  in  developing  creative  solutions  for  the  analysis  and  testing  of  sensitive  and   highly  regulated  data  sets  at  industry  leaders  such  as  Blue  Cross  Blue  Shield  of   Florida  (now  Florida  Blue),  Vistakon  (a  subsidiary  of  Johnson  &  Johnson),  and   Availity.     Catherine  Cruz  Agosto     Catherine  Cruz  Agosto  found  her  software  engineering  experience  at  Baxter   Healthcare  and  Boeing-­‐subsidiary  Insitu  provided  an  excellent  foundation  for   finding  more  effective  and  user-­‐friendly  approaches  to  complex  technical  problems.   Catherine  has  developed  more  efficient  and  innovative  data  quality  testing  solutions   at  healthcare  intermediary  Availity,  expanding  their  automated  data  quality  testing   processes  to  accommodate  diverse  and  dissimilar  data  sources,  thus  facilitating   analysis,  testing,  and  controls  for  data  integration,  analytics,  and  healthcare  data   reporting.  
  • 3.
    Data  Quality  at  the   Speed  of  Work     By  Shauna  Ayers  and  Catherine  Cruz  Agosto  
  • 4.
    Overview   •  Definitions • Why is this important? •  What strategies can we use? •  What benefits do these activities bring us? •  What tools do we use? •  Case Studies •  Communication •  Conclusion  
  • 5.
    Defini+ons   ●  Dataquality (DQ) is data's fitness and usability for its intended purpose. ●  Data quality assurance is the monitoring and analysis of data sets and the processes that create or manipulate data, in order to ensure the data’s quality meets the company's needs. ●  DQ Issue: Incorrect or unexpected behavior from the data as a result of unknown data scenario, upstream change, flaw in logic, missing requirements, etc. ○  Timing Issue: A type of issue/defect in which the root cause stems from the timing between two or more components of the system that depend on each other.
  • 6.
    Why  is  this  important?   •  Consumers expect data to be instantly available •  Consumers expect near-zero downtime •  Automation and algorithmic transactions cause a small data issue to snowball quickly •  If consumers don’t feel they can trust your data, they won’t be your customers for long  
  • 7.
    What  strategies  can  we  use?   ●  Types of Testing ○  Exploratory ○  Manual ○  Automated ●  Continuous Regression ○  Production Monitoring vs Monitoring Lower Environments ●  Continuous Data Profiling
  • 8.
    What  strategies  can  we  use?     (con6nued)   ●  Types of Checks and how to use them to identify timing issues ○  Business Rule Validations: Type of test that verifies all of the acceptance criteria by comparing the source data to the target data. ■  This type of check catches any discrepancies or deviations from the acceptance criteria. ○  Null Checks: Type of test that verifies key fields are not null ■  Verify that fields that are expected to be populated are done so from the initial write, instead of as an update later on. ○  Duplicate Checks: Type of test that checks for any unexpected duplication of records, typically by use of alternate key. ■  Can be used to spot duplications that are created over time.
  • 9.
    What  strategies  can  we  use?     (con6nued)   ●  More types of Checks and how to use them ○  Environment Checks: Type of test that verifies if the process run is within tolerance. ■  Can be used to identify if and when process is running behind, which can explain any data issues with downstream processes. ○  Count Checks: Type of tests that compares the count of records in the source to the count of records in the target. ■  Timing issue could be a potential cause for count mismatch. ○  Compare Checks: Type of tests that compares the alternate key of records in the source to the alternate key of records in the target. ■  A mismatch in data could indicate potential timing issue ■  Can use compare check to get the details on a count check discrepancy
  • 10.
    What  strategies  can  we  use?     (con6nued)   ●  Even more types of checks and how to use them ○  Domain Integrity Checks: Type of test that verifies the values used in specified field exist in the corresponding code set. ■  Could indicate discrepancy between timing of value added to code set and use of code value. ○  System Version Checks: Type of test that checks when there are changes to the version the system is running on. ■  Changes and/or updates to system versions can cause unexpected issues such as difference in process behavior, difference in system clocks, etc.
  • 11.
    What  benefits  do  these  ac+vi+es   bring  us?   •  Opportunity to fix issues before the customer sees or reports them •  Faster localization of root causes •  Better visibility of chronic issues rooted in timing and environment •  Better visibility of changes in input profiles •  Cleaner integration with existing operational support  
  • 12.
    What  tools  do  we  use?   ●  Buying DQ testing software o  Common tools: Informatica Data Quality, Datamartist, Microsoft Data Profiling Task o  All tools have some sort of limitations o  Can get expensive ●  Creating custom test harnesses o  Seems more time consuming up-front o  More control/ less limitations compared to pre-bought ●  Machine cannot replace a human
  • 13.
    Case  Studies:     Data  Integra+on  Timing   ●  Definition: The timing of ETL processes in relation to each other and the supporting systems they depend on. Risks affect execution order, dependencies, and load rule boundaries across processes. ●  Useful Checks: o  Count/ Compare checks o  Tolerance/Threshold checks (includes cycle time checks) o  Environment checks o  Business Rule Validations ●  Case Studies o  Hybrid systems – the velocity/dependency trap o  Clock syncs sink ships o  Who watches the watchmen? o  Surge Protection  
  • 14.
    Case  Studies:     Opera+onal  Dependencies   ●  Definition: Two or more processes of a system or components of a process that rely on each other. ●  Useful Checks: ○  Codesets ○  BRV ○  Null Checks ○  System Version Checks ○  Count/ Compare checks ○  Environment Checks ●  Case Studies ○  Rocket Failure ○  Data Warehousing ○  UI to Backend
  • 15.
    Case  Studies:     Reference  Data  Management   ●  Definition: Reference values are used to drive categorization, routing and filtering, and may provide part of the focus for dimensional data. They are normally controlled data sets. Some ●  Useful Checks: o  Domain checks o  Tolerance/Threshold checks o  Consistency checks ●  Case Studies o  Point-of-Use Domain Checks o  Rate of Dimensional Growth (runaway conditions in the content) o  Process violations    
  • 16.
    Case  Studies:     Data  Integrity   ●  Definition: The correctness of the data in or outputted from the system ●  Useful Checks: o  BRV o  Null checks o  Domain Checks o  Null Checks o  Duplicate Checks o  Count/ Compare Checks o  Environment Checks ●  Case Studies o  Transaction Processing o  Reporting
  • 17.
    Communica6on:     Proac+ve  No+fica+on  Alerts   •  Automated  no+fica+on   mechanisms  can  be   integrated  easily  with   exis+ng  opera+onal   alert  mechanisms  (e.g.,   pager  duty)   •  No+fica+ons  and  alerts   can  be  tailored  to   support  and  reinforce   data  stewardship    
  • 18.
    Communica6on:     Business  Intelligence  Dashboards   ●  External Dashboards ○  Potential Users: Customers, Production Support, Customer Service, Business ●  Internal Dashboards ○  Display more granular data regarding processes and/ or tests ○  Drill-through
  • 19.
    Communica6on:     Trends  Analysis   ●  Performance and tolerance checks over time reveal cyclic impacts from maintenance activities or correlation of surges in quality issues to specific business activities. These drive preventive measures, capacity planning and performance tuning.  
  • 20.
    Conclusion   ●  Proactivedata quality saves an organization time and money. ●  Data is the fastest changing element of an organization; there is no cookie cutter way of monitoring or testing, but there are known strategies that can be used to help maneuver the course. ●  Metadata about data quality testing can be used to communicate issues faster, more easily target the correct parties, and provide insights as to the health of the systems that drive the organization.
  • 21.