Dw19 t1+ +dq+fundamentals-cvs+template

Proprietary
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 1
#damaweek2019
Data Quality
Fundamentals
Laura Sebastian-Coleman, Ph.D., CDMP
November 2019

Data Quality
Fundamentals
Laura Sebastian-Coleman, Ph.D., CDMP
Data Quality Lead
Shared Services Enterprise Data Governance, CVS Health
DAMA Days – DAMA Mexico
November 2019
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary

Proprietary
Abstract: DQ Fundamentals
• Organizations today get value from their data in the face of challenging odds.
• Optimal management of traditional data requires a wide skillset and a strategic perspective.
• Changes in technology have increased the volume, velocity, and variety of data, but many
organizations do not yet have a handle on veracity in traditional data management
environments, never mind big data environments.
• And, while big data is on the rise, more traditional forms of data are not going away. Instead,
different kinds of data will co-exist and must be managed in conjunction with one another.
• This tutorial will revisit the fundamentals of data quality management in the light of big data and
explore how to apply them in traditional and big data environments.
• Participants will learn how to assess the current state of their data environment and deliver
more reliable data to their stakeholders.

Proprietary
About me
Data quality practitioner in the health care industry since 2003
Background in banking, manufacturing / distribution, commercial insurance, and academia
Publications
– Author, Navigating the Labyrinth: An Executive Guide to Data Management (2019)
– Production Editor, DAMA Data Management Body Of Knowledge second edition, [DMBOK2] (2017)
– Author, Measuring Data Quality for Ongoing Improvement (2013)
Service
– Advisor, DAMA New England, 2019 - present
– DAMA Publications Officer, 2015 – 2019
– IAIDQ (now IQ International) Member Director, 2010-12
Recognition
– DAMA International Recognition for Outstanding Contributions to Data Management, 2019
– DAMA New England Award for Excellence in Data Management, 2019
– IAIDQ (now IQ International) Distinguished Member Award, 2015

Proprietary
Abstract and Agenda
Abstract:
Organizations today get value from their data in the face of
challenging odds. Optimal management of traditional data
requires a wide skillset and strategic perspective.
Changes in technology have increased the volume, velocity, and
variety of data, but many organizations do not yet have a handle
on veracity in traditional data management environments, never
mind big data environments. And, while big data is on the rise,
more traditional forms of data are not going away. Instead,
different kinds of data will co-exist and must be managed in
conjunction with one another.
This tutorial will revisit the fundamentals of data quality
management in the light of big data and explore how to apply
them in traditional and big data environments. Participants will
learn how to assess the current state of their data environment
and deliver more reliable data to their stakeholders.
Agenda
• Introductions
• Quality management concepts and
principles
• Applying quality management to traditional
data
• The role of measurement and monitoring
• Big Data challenges
• Data Quality Practices for Big Data and
Little Data

Proprietary
Why Data Quality matters: Because data is valuable

Proprietary
Why DQ Management Matters: Poor quality data cost money
• Reports differ, but many estimate that between 10-30% of productivity is lost due to poor quality
data.
• This seems low, since one report indicated that Data scientists spend 60% of their time
cleansing data.
• IBM estimated that data quality problems cost the US $3 Trillion in 2016.
Unproductive
Time
30%
Productive
Time
70%
10-30% of productivity is lost
due to poor quality data
Time spent
Cleansing
Data
60%
Time spent
Analyzing
Data
40%
Data Scientists' time:
Cleansing vs. Analyzing Data

Quality Management Concepts
A short history of an important idea

Proprietary
Definition of Quality: Fitness for Purpose / Fitness for Use
Data Quality: A measure of the degree to which data is fit for
the purposes of the people, processes, and systems that use
the data.
The concept of “fit for purpose” directly relates data quality to
the quality of manufactured products.
Data = a Product. Data NOT a by-product.
“Fit for Purpose” also relates data quality to the concept of a data
consumer – a person, process, or a system that uses data.
Data Quality Management: A set of activities intended to
ensure that data is fit for purpose by data consumers.

Proprietary
Manufacturing: A brief history of mass-produced products
19th Century Industrial Revolution:
• Steam power
• Interchangeable parts
• Development of large factories
• Production line manufacturing processes
• Machine tooling
20th Century Mass Production:
• Machine tooled interchangeable parts
• Assembly line
• Vertical integration of the manufacturing process
• Quality control

Proprietary
Power, Process, Technology, and Standardization enabled Vertical
Integration of Manufacturing

Proprietary
Pioneers of Quality Control
• Defined criteria for quality based on customer
expectations
• Recognized the relation between a well-defined
process and a predictable outcome
• Used measurement to manage and improve
processes
• Created tools to assess and improve product
quality
• Recognized that producing a quality product
requires life cycle management, supply chain
management, and leadership commitment

Proprietary
Quality Control in Manufacturing: Product and Process
A process is a series of steps that turn inputs into outputs.
• The better quality the inputs
• The better defined the steps
• The better quality the outputs
Add to this the idea that the execution of processes can be improved through observation, analysis, and
feedback.
The more consistent the input and the more consistently the process is executed, the more consistent the
result.

Proprietary
Quality and the Customer
Thought leaders in Quality Control / Quality improvement recognize that there is a customer at the end of the
assembly line: Someone wants to buy the product.
That person has expectations at two levels:
• At the very least, the Product must perform its primary function.
• Ideally, the Product also pleases the customer in some way.
Dimensions of Product Quality (from David Garvin)
– Performance: The product operates as expected.
– Features: The product has additional characteristics that please the customer.
– Reliability: The product works well. The customer can count on it.
– Conformance: The product meets standards.
– Durability: The product lasts for an expected amount of time.
– Serviceability: If the product breaks it can be fixed.
– Aesthetics: The product is attractive and pleasing.
– Perceived Quality: The customer feels good about the product.

Proprietary
Intention and quality: Quality is not accidental
Source: Kaizen institute of India.
https://kaizeninstituteindia.wordpress.com/2013/10/08/quality-is-
not-an-act-it-is-a-habit/

Proprietary
Life Cycle Management
Life Cycle management is an extension of the idea of quality
control to all aspects of creating a product.

Proprietary
The Role of Measurement in Quality Control
Statistical process control – a means to
measure the consistency of processes
Measurement formalizes expectations
Monitoring ensures unexpected
variation within the system is detected

Applying Quality Management
Concepts to Data
Produce data like we produce other
products

Proprietary
Definition: Data Quality Management
Data Quality: A measure of the degree to which data is fit for the purposes of the people and
systems that use the data.
Data Quality Management: A set of activities intended to ensure that data is fit for purpose,
including:
• Data quality assessment
• Data quality requirements definition
• Data quality monitoring
• Data issue detection
• Issue remediation
• Reporting on data quality
• Improving business and technical process to ensure data is of high quality
What you mean
by high quality
data
How you detect
low quality data What you do
about low
quality data
All data management
processes have the
potential to impact the
fitness of data for use.
Not every process
needs to be called “data
quality” processes.
Core Data Quality
processes have
foundational, project-
oriented, and
operational
components.

Proprietary
Stuff Connected with DQ Management
but are not exactly DQ Management
Quality Assurance
• Kinda DQ: QA focuses on quality.
• Not Quite DQ: Focus on functionality, may or may not
include data. Project process, rather than ongoing
process.
System Controls (manage data movement)
• Kinda DQ: System Controls help confirm data
completeness – they show that you have not lost
data.
• Not Quite DQ: You need them to run the system,
regardless of the quality of data.
Architecture / System Design to enforce quality
• Kinda DQ: System design can directly impact the
quality of data
• Not Quite DQ: System design encompasses many
other things that are not data
Metadata Management
• Kinda DQ: You cannot understand data without
metadata
• Not Quite DQ: Metadata is a form of data. It requires
the same kind of DQ management that other forms
of data require.
Data Stewardship
• Kinda DQ: Stewards know a lot about data and
much of what they do helps us understand data
quality
• Not Quite DQ: Stewardship is wider than quality and
may not even focus on quality.
Data Cleansing
• Kinda DQ: It makes the data better. Isn’t that the
point?
• Not Quite DQ: Data cleansing is a solution to some
data quality issues. It is not a goal of DQ
Management.

Proprietary
Data as the Product of a Process
Process: A process is a series of steps that turn inputs into outputs

Proprietary
Data as the Product of a Process DQ problems
are usually
detected in
Data Output
But those problems can
be caused at any point in
the production or
consumption process
Data Quality is understood in terms of outputs
• Expected outputs = Good Quality Data
• Unexpected outputs = Poor Quality Data

Proprietary
Complexity increases risks associated with data
Risk multiplies as data
moves along the data
chain from system-to-
system, use-to-use.

Proprietary
Intention: Data Quality Improvement via PDCA
The same processes that are applied to improve the
quality of manufactured products can be applied to
improve the quality of data.
Different improvement methodologies use essentially
the same process.
• Six Sigma
• Lean
• Total Quality Management

Proprietary
Include Simplified
Improvement cycle
Define Quality:
Expectations,
Standards, Rules
Assess Data Against
Expectations,
Standards, Rules
Define
Measurement /
Monitoring
requirements
Monitor Data
Quality
Report on Data
Quality results
ManageData Issues
Identify and act on
Improvement
opportunities
Basis for
Assessing
Quality
Revise / Improve
Provide
Input
to
Provide
Input to
Manage
issues

Proprietary
Define Quality:
Expectations,
Standards, Rules
AssessData Against
Expectations,
Standards, Rules
Define
Measurement /
Monitoring
requirements
Monitor Data
Quality
Report on Data
Quality results
ManageData Issues
Identify and act on
Improvement
opportunities
Basis for
Assessing
Quality
Revise /
Improve
Provide
Input
to
Provide
Input to
Manage
issues
Requires:
 Standards for rules
 Criteria for criticality
 SME / Data Consumer input
 Working set of CDEs
 Feedback process
 Maintenance process
Requires:
 Accessto data
 Profiling engine and process
 Evaluation methodology
Requires:
 Standards for rules
 Analysis of historical data
 Specification template
 Staff to implement and
maintain
Requires:
 Guidelines and goals for monitoring
 Processautomation / tooling
 Staff to review and respond
 Response protocols
 Access to system and business SMEs
Requires:
 Goalsbased on SME / Data
Consumer input
 Reporting standards/
templates
 Reporting tool
 Schedule
Requires:
 Processflow
 Issuedefinition template
 Prioritization criteria
 Escalation path
 Tracking tool
 Accessto decision makers
Requires:
 Knowledgeof businessgoals
 Knowledgeof data issues
 Root cause analysisskills
 Proposal process
 Funding process

Proprietary
Data Quality
Improvement Cycle

Proprietary
Data Life Cycle Management
Adapted from Danette McGilvray,
Executing Data Quality Projects: Ten
Steps to Quality Data and Trusted
Information

Proprietary
Manage Data Quality throughout the Data Life Cycle
Managing quality
throughout the data life
cycle requires
• Data Governance
• Metadata Management
Adapted from Danette McGilvray,
Executing Data Quality Projects: Ten
Steps to Quality Data and Trusted
Information

The Role of Measurement and
Monitoring
You cannot manage what you cannot
measure

Proprietary
Dimensions of Data Quality – Why they matter
• Data quality dimensions function in the way that length,
width, and height function to express the size of a physical
object.
• They allow understanding of quality in relation to a scale
and in relation to other data measured against the same
scale.
• Data quality dimensions can be used to define
expectations (the standards against which to measure) for
the quality of a desired dataset, as well as to measure the
condition of an existing dataset.
• Dimensions provide an understanding of why we measure
(what question a measurement answers). For example, to
understand the level of completeness, validity, and integrity of
data.
• Dimensions also help us identify things that we cannot
measure or that there is little value in measuring.

Proprietary
Data Quality / Quality of Data
Data Quality / Quality of Data: A measure of the degree
to which data is fit for the purposes of the people and
systems that use the data.
What contributes to data’s “fitness for purpose”?
• Representational Effectiveness: How well and
consistently data represents the concepts it stands for.
• Data Knowledge: How well data consumers understand
and can de-code the data.
• Dimensions of Quality: How well data conforms to
expectations expressed via measurable Characteristics
of Quality

Proprietary
Dimensions of Data Quality
A Dimension of Data Quality is a characteristic that of data that can
be measured and through which its quality can be quantified.
There are many frameworks that define DQ dimensions. There is not
an agreed-to set. However all account for similar concepts, which
have a common sense meaning.
• COMPLETENESS: You have all the pieces of data you need or expect to
have.
• FORMAT CONFORMITY: Data is in the form you expect it to be in.
• VALIDITY: Data values belong to the set of possible (expected) values.
• INTEGRITY: Different pieces of data relate to each other in the ways you
expect them to.
• CONSISTENCY: Data follows patterns that you expect it to follow.

Proprietary
Data Quality Issue / Data Quality Improvement
Data Quality Issue: A data quality issue is
any condition of the data that is an obstacle
to a data consumer’s use of the data,
regardless of the root cause of the obstacle.
• Issues can be caused by actual errors – a
person or a process made a mistake.
• Or by any of the challenges inherent in data
• People misunderstand / misinterpret
• Technology does unexpected things
• People disagree
Data Quality Improvement: A Data Quality
improvement is a measurable, positive
change in data that makes it more fit for use.
In other words: any change that reduces or
removes an obstacle.

Proprietary
Logical Relationship between DQ Dimensions
• COMPLETENESS: You have all the pieces of data you need or expect to have.
If you do not have all the data you need, then other measurements of quality may not even matter.
• FORMAT CONFORMITY: Data is in the form you expect it to be in.
If data is not in the right format, then it cannot be valid or relate to other data in the ways you expect.
• VALIDITY: Data values belong to the set of possible (expected) values.
If the data values are not in the allowed set of values, then they cannot be correct.
• INTEGRITY: Different pieces of data relate to each other in the ways you expect them to.
If the different pieces of data do not fit together in the ways you expect, then you cannot use the data in the way
you intended to.
• CONSISTENCY: Data follows patterns that you expect it to follow.
If the data does not follow expected patterns, then you will want to understand why (Change in the pattern or error
in the data?)

Proprietary
Using DQ Dimensions to Create Standards & Rules
Standard: A level of quality or attainment (a high standard for customer
service); an idea or thing used as a measure, norm, or model in comparative
evaluations (e.g., ISO standards)
Rule: one of a set of explicit or understood regulations or principles
governing conduct within a specific activity or sphere (e.g., Roberts Rules of
Order); a principle that operates within a particular sphere of knowledge,
describing or prescribing what is possible or allowable.
Business Rule: A business rule is a rule that defines or constrains some aspect
of business and always resolves to either true or false. Business rules are
intended to assert business structure or to control or influence the behavior of
the business.
STANDARDS represent an intersection of Data Quality Management and Data
Governance. Standards are also a form of metadata. DQ uses them to
measure, but they are also simply useful to explain expectations.

Proprietary
Benefits of Rules
• Common vocabulary – People understand expectations
in a similar way
• Consensus – People agree to the same things.
• Differences – People can also disagree about rules.
Rules provide a means of surfacing and therefore
clarifying different expectations.
• Simplicity – People make decisions once
• Predictability – People know what is expected of them
and they try to achieve it
• And with data… they give us a way to talk more
objectively about quality

Proprietary
Using DQ Dimensions to Create Standards & Rules
Dimensions of quality are the foundation of a common
vocabulary through which to articulate expectations
for quality. They can be used to:
• Create standards and rules & controls for models
and applications
• Establish measurements
• Report problems consistently
For example, completeness can be understood at
several levels: system, data set, record, field.
Dimension Data Object Rule
Complete-
ness Data Set
The number of [distinct entity] must be
equal to the number of [distinct entity] in
[Source]
Complete-
ness Data Set
[Amount field] must reconcile to [Amount
field] in [Source]
Complete-
ness Field Must be populated
Complete-
ness Field
Must be populated; standard default value
allowed
Complete-
ness
Optional
fields, with
population
rules Must be populated when …
Complete-
ness
Optional
fields, with
population
rules
Must be populated except when …
Must NOT be populated when …

Proprietary
Logic for Field Level Completeness
Dimensions enable you to
consistently describe the
characteristics you are looking
for.
Many rules can be defined
through a logical progression of
questions related to the
dimension.
Here is an example focused on
completeness at the field level.
Similar questions could be asked
at the system or file level.

Proprietary
Logic for Format Conformity
As with completeness, with conformity,
we can establish a logical progression
of questions to define expectations at
the field level.
Some fields have one-and-only-one
acceptable format.
More complex fields may have a set of
format requirements.
Others are constrained only by data
type and format may not indicate much
about the quality of data.

Proprietary
Logic for Validity
The word validity is used to refer in a
general sense to whether or not the data is
“good”.
As a dimension of quality, it refers to
whether values are part of a defined
domain.
Validity Rules can be based on how the
domain of values is defined.

Proprietary
Sample Rule Syntax – Validity for Codified Data
Working through the decision tree
results in a standard syntax for
expressing rules.
Rules can be used to:
• Clarify expectations about quality
• Measure data quality
• Report on data quality
Depending on how much we know
about the data, they can also be used to
transform data. For example, to
populate a consistent default value for
all invalid values.
Dimension Data subset Rule Meaning
Validity
Codified
data
Valid values are
limited to: [List of
valid values]
Specifies a list of values that
are valid. All other values
are invalid.
Validity
Codified
data
Values must exist in
[code table / column
…]
Specifies the code table and
the column in the code
table in which valid values
are stored. All other values
are invalid.
Validity
Codified
data
The range of valid
values is between:
[MIN] and [MAX]
Provides the MIN and MAX
value for the range. Any
values outside of the
MIN/MAX are invalid.
Validity
Codified
data
Invalid values include:
[List of invalid values]
Specifies a list of values that
are not valid. All other
values are valid.

Proprietary
Using DQ Dimensions to Create Good Measurements
Characteristics of Good Measurements
• Meaningful: They are focused on characteristics that are important.
» Think: taking a child’s temperature
• Comprehensible: They present information that people can understand.
» Think: understanding how to read a thermometer
• Actionable: They allow people to make a decision or take an action.
» Think: knowing what to do when the temperature is higher than normal
Dimensions of help in all of these
• Meaningful: They are focused on characteristics that are important
– They define what = GOOD Quality and what = BAD Quality
• Comprehensible: They present information that people can understand.
– They allow people to understand what is RIGHT or WRONG with the data – it is incomplete, invalid, etc.
• Actionable: They allow people to make a decision or take an action.
– They allow people to decide whether or not the data they want to use is FIT FOR PURPOSE

Proprietary
Sample Data Quality Standards
Data Element Completeness Conformity Validity Integrity
Date of Birth Must be populated
Must conform to the date
format requirements of
the system in which it is
present
Cannot be a future date; person
cannot be older than 120 years
(based on current date)
All occurrences of records for an
individual should have the same
date of birth
Birth Gender Must be populated See Validity
Values must exist in [list OR code
table / column …]
individual should have the same
birth gender
Gender Identity
Optional -- no known
rules for population
See Validity
Values must exist in [list OR code
table / column …]
individual within a time frame
should have the same gender
identity

Proprietary
Sample Data Quality Standards
Data Element Completeness Conformity Validity Integrity
Ethnicity
Optional -- no known rules
for population
See Validity
Values must exist in [list OR
code table / column …]
Optional field - no integrity
rules
Race
Optional -- no known rules
for population
See Validity
Optional field - no integrity
rules
Marital Status
Situational -- required for
some business processes
See Validity
Situational
Relationship to
subscriber
Must be populated See Validity
No rules identified; value
can change over time.

Proprietary
Applying DQ Dimensions – Lessons Learned
• The dimensions provided a new perspective on the data.
• Seeing the data separate from and in relation to systems
– What is optional in a system may be mandatory to a downstream process
• Translating common sense expectations about the data into consistent, objective criteria for reasonability
– Every person has a birth date and we can define a reasonable range for birth dates within a database
• Seeing gaps in expectations
– Marital status -- should everyone have a marital status, even a child? Or should only subscribers have this?
• Seeing that some concepts are not well-defined and may never be well-defined
– Race, ethnicity
• Some concepts that we once considered well defined are evolving
– Gender identity vs. birth gender
Despite the flux, we were able to come to consensus on our expectations – the dimensions provided a vocabulary to do so.
They allowed us to clarify expectations in a consistent manner.

Proprietary
Example Measurement
This measures the level of
completeness [MUST BE
POPULATED] of a critical field
It is very simple, because the
concept itself is very simple.
In many cases, you don’t need
more than this.
0.0000
0.1000
0.2000
0.3000
0.4000
0.5000
0.6000
0.7000
0.8000
0.9000
27/12/2018
29/12/2018
31/12/2018
02/01/2019
04/01/2019
06/01/2019
08/01/2019
10/01/2019
12/01/2019
14/01/2019
16/01/2019
18/01/2019
20/01/2019
22/01/2019
24/01/2019
26/01/2019
28/01/2019
30/01/2019
01/02/2019
03/02/2019
05/02/2019
07/02/2019
Trend

Proprietary
Example of Rolled Up Score Card with visual
MAY 2019 Summary Scorecard Dimensions of Quality
Line of
Business
Source System Completeness Conformity Validity Overall Score
Upper
Threshold
Lower
Threshold
Status
ABC
ABC1 97% 93% 100% 96.5% 98.5% 95.0% Amber
ABC1 97% 94% 100% 96.7% 98.5% 95.0% Amber
DEF
DEF1 96% 93% 100% 96.1% 98.5% 95.0% Amber
DEF2 97% 91% 100% 96.1% 98.5% 95.0% Amber
GHI GHI1 99% 98% 100% 98.7% 98.5% 95.0% Green
JKL JKL1 96% 98% 100% 98.0% 98.5% 95.0% Amber
MNO MNO1 97% 93% 100% 96.7% 98.5% 95.0% Amber

Proprietary
90.0%
91.0%
92.0%
93.0%
94.0%
95.0%
96.0%
97.0%
98.0%
99.0%
100.0%
Overall Data Quality
Nov-18 Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19
ABC1 ABC2 DEF1 DEF2 GHI1 JKL1 MNO1

Limitations of “Data as a Product”
OR Why Data is Different
Subtitle

Proprietary
Limitations of the Product Metaphor
The Challenges
• Data is not a physical product.
• Data is not tangible, but it is durable
• Data is easy to copy but very hard to reproduce from
scratch
• The same data can be used by multiple people and
processes at the same time
• Data is volatile
• The value of data changes based on context and
timing
• Using data often results in new data
Adapted from DMBOK2, chapter 1, which is adapted from Redman, Data Driven.
The Risks
An organization does not know what data it has
Data can be lost, breached, or misused
Data is replicated and variation is created between data
sets
The quality of data deteriorates over time or across
functions
Knowledge of data deteriorates within the organization
NOT JUST FITNESS FOR PURPOSE
Representational effectiveness: How well and
consistently data represents the concepts is it
intended to represent
Data Knowledge: How easily and well data
consumers can “decode” data

Proprietary
We don’t treat data as a product
• Production: Data comes from many places; very little control over the inputs
• Inventory: Organizations do not know what data they have, what condition it is in, what relation it has to the
processes that created it, etc.
• Storage: The ways that we store data have an impact on its quality, but we do not always account for this when we
work with data.
• Usage: Don’t know how data will be used – bring this into question. We do not recognize the connection between data
production and data uses
What would happen if we
treated physical products
the way that we treat data?

Proprietary
What is the Product? Who is the customer?
The challenges with the product concept are
related to how data evolves within the data
lifecycle and to different levels of awareness
of data as a product along the data chain:
• Evolution: Data has many uses, and these
uses change over time.
– Example: Mail order companies once wanted your
address simply to ship you a product. Now they
want it to understand customer demographic
patterns.
• Evolution: Once people start using data, they
want to refine data.
– Example: Transition from ICD-9 to ICD-10
represents a refinement of diagnosis codes
• Data Chain: Data that meets its initial quality criteria may
not be of high quality for downstream uses.
– Example: Data may be good enough to enable a claim to
be adjudicated, but not good enough to do outreach to a
member
• Data Chain: Many upstream processes are not aware of
the downstream uses of data.
– A field that is not required for Provider Demographics
may be required to assess quality of care

Proprietary
DQ Challenges stem from the nature of data
• The semiotic challenge: People have
different ways of representing the “same”
concepts. Data is disparate.
• The knowledge challenge: Because data is
complicated, a single individual cannot
know all the data.
• The technical challenge: Different technical
approaches to creating and using data
influence the data itself and impact its
quality.
• The political challenge: Data is knowledge,
knowledge is power, power is political.
The simple answer: Define, Measure, Monitor
• Define: To reduce ambiguity
• Measure: To confirm the actual state of the
data
• Monitor: Detect change over time
More complicated answer: These are hard to do

Proprietary
The Semiotic Challenge: Reality and Data
Source: Measuring Data Quality for Ongoing
Improvement. By Laura Sebastian-Coleman
(Morgan Kauffmann, 2013)

Proprietary
The Semiotic Challenge: Data and Reality
Source: Measuring Data Quality for Ongoing
Improvement. By Laura Sebastian-Coleman
(Morgan Kauffmann, 2013)

Proprietary
Data Quality as a Technical Challenge
Different technical approaches to creating and
using data influence the data itself.
Example: SAS vs. Hadoop rounding
difference
(Desc and acct numbers modified for example)
Problem: When data is extracted
in Hadoop & SAS using the same
query, there is a difference in the
number of records extracted
(7,192 records for Jan-18 period).
Observation: All records have
YTD_Actual less than 50 cents,
absolute value
Hypothesis: It looks like Hadoop
rounds differently than SAS,
which resulted in data that has
values rounded to '0' in
'YTD_ACTUAL' field was
excluded from the query.
PERIOD_KE
Y
MAJ_ACCT_DESC MIN_ACCT_DESC
YTD_ACTU
AL
Jan-18 Account 1 Health Management, LLC 0.01
Jan-18 Account 2 MEDICARE -0.01
Jan-18 PREPAID EXP Commissions-HMO Based Product -0.02
Jan-18 CURR & DEF'D TAXES CUR INC TAXES - STATE 0.29
Jan-18 MISC LIAB Settlements -0.1
Jan-18 EXP - GEN'L Telecom Comm Equip:Owned -0.39
Jan-18 EXP - GEN'L Phone - Local/Long Distance 0.13
Jan-18 EXP - GEN'L Phone - Local/Long Distance 0.33

Proprietary
Data Quality: The Knowledge Challenge
The knowledge challenge: In any organization, data is more complicated than a
single person can comprehend. Because data is complicated, it cannot be
managed without metadata (documented knowledge about the data).
The challenge goes beyond knowledge of the data to knowledge of how to manage
data quality. It includes:
• Unexplored assumptions about data and data management – some of which we
covered in the semiotic challenge.
• Lack of consensus about the meaning of key concepts (Data, Data Quality, Data
Quality measurement) – which is why I started with definitions.
• Lack of clear goals and deliverables for the data assessment process.
• Lack of a methodology for defining “requirements”, “expectations” and other
criteria for the quality of data at the level needed for measurement. These
criteria are necessary for measurement.

Proprietary
Data Quality as a Political Challenge
The political challenge: Data is knowledge, knowledge is power, power is political
Etymology of Politics:
• Poli = many
• Ticks = blood sucking vermin
Most people dislike politics.
People do not always mean to be political about data.
But data represents business processes, so people are protective.
Their data may be high quality OR it may be low quality data OR they may not know.
Data is about knowledge and people like to be knowledgeable. No one likes to feel
“un- knowledgeable” (i.e., dumb).

Overcoming the Knowledge
Challenge
Through Profiling and Data Inspection

Proprietary
Definition: Data Quality & Data Quality Management
Data Quality: A measure of the degree to which data is fit for the purposes of the people and systems that use the data.
Data Quality Management: A set of activities intended to ensure that data is fit for purpose, including:
• Data quality assessment
• Data quality requirements definition
• Data issue detection
• Issue remediation
• Reporting on data quality
• Improving business and technical process to ensure data is of high quality
All data management processes have the potential to impact the fitness of data for use. But not every data management
process needs to be called a “data quality” process.
This is the activity of running
the engine and looking at the
results (data profiling and
analysis).
Analysis of profiling results
supports these activities

Proprietary
Definition: Data Profiling
• Assessment is the process of evaluating or estimating the nature, ability, or quality of a thing.
• Data quality assessment is the process of evaluating data to identify errors and understand their
implications (Maydanchik, 2007).
• Data profiling is a specific kind of data analysis used to discover and characterize important
features of data fields and data sets, including:
– Data types
– Field lengths
– Cardinality of columns
– Granularity
– Existing values
– Format patterns
– Content patterns
– Implied rules
– Cross-column and cross-file
data relationships

Proprietary
Data Profiling Inputs and Outputs

Proprietary
Profiling Goals – Overcoming the Knowledge Challenge in Projects
• Reduce risks related to data development
• Enable initial assessment of source-supplied metadata to reduce the risk of errors related to incorrect identification
of data fields
• Identify risks and obstacles to use of sources (data issues, incorrect assumptions, differences in data granularity,
naming conventions, etc.)
• Accurately identify encryption requirements
• Identify critical data for ongoing data quality measurement, monitoring, and reporting
• Improve project process efficiency
• Improve the quality and consistency of system metadata, beginning with table and column definitions
• Provide input to mapping, including conformance
• Provide input to data modeling
• Provide input to ETL design, including system controls
• Provide input for Quality Assurance and User Acceptance Testing
• Enable Governance over time
• Improved metadata

Proprietary
DART – Data Analysis Results Template
The DART’s worksheets break down into four groups:
• Reference information: Five tabs describe the template and provide guidance on how to observe data characteristics
in profiling results
• Project information: Two tabs bookend the process. One for project goals, the other for summarized findings and
action items.
• Findings and analysis: Four tabs that make up the core of the template and allow analysts to consistently document
what they see. (Note: At this time, the DQ Analysis tab will not be used by projects)
• DQ specification: Captures details for DQ measurements. Input for this will come from the findings and analysis tabs
REFERENCE TABS
TEMPLATE PURPOSE AND
OVERVIEW
GUIDELINES and USAGE NOTES
DQ CHECK LIST
FIELD DEFINITIONS
DOWNLOADS
PROJECT ADMIN TABS
PROJECT DETAILS
SIGN OFFS and ACTION ITEMS
ANALYSIS AND FINDINGS TAB
CONTEXT and METADATA RESULTS
TABLE LEVEL RESULTS
COLUMN LEVEL For Project Work
SPECIFICATION TAB
DQ MEASUREMENT
SPECIFICATION

Proprietary
The DART
Process and
Results

Proprietary
Example Overall Findings
FINDING CATEGORY COUNT PERCENTAGE
No Data -- 100% defaulted 85 35%
Data appears as expected 64 26%
Technical field 21 9%
Data Differs from Metadata 19 8%
Questionable Values 17 7%
Should be encrypted, is not 10 4%
Sparse Data -- 99% Defaulted 9 4%
Questionable Population 7 3%
Conformance Risk 6 2%
Questionable Population and
Values 4 2%
TOTAL 242 100%
0
10
20
30
40
50
60
70
80
90
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%

Proprietary
Making findings actionable
FINDING CATEGORY COUNT PERCENTAGE ACTION
Data appears as expected 64 26% No action
Technical field 21 9% No action
No Data -- 100% defaulted 85 35% Determine impact to project
Data Differs from Metadata 19 8% Clarify with source system; update metadata
Questionable Values 17 7% Clarify with source system; update metadata
Sparse Data -- 99% Defaulted 9 4% Clarify with source system; update metadata
Questionable Population 7 3% Clarify with source system; update metadata
Questionable Population and Values 4 2% Clarify with source system; update metadata
Conformance Risk 6 2% Inform BA's and Modelers
Should be encrypted, is not 10 4% Revise encryption requirements for file
TOTAL 242 100%
0
10
20
30
40
50
60
70
80
90
COUNT Of FINDING CATEGORYNo action required
Determine if there is impact
Clarify with Source System
Manage within the workstream
• Inform BA’s and Modelers
• Update requirements

Big Data Challenges
Because it is not getting simpler…

Proprietary
Some people are optimistic about Big Data
Often, big data is messy,
varies in quality .... What we
lose in accuracy at the micro
level we gain in insight at the
macro level.
Viktor Mayer-Schonberger and Kenneth Cukier, Big
Data: A revolution that will transform how we live,
work, and think.
A data lake's data quality practices are
less about the syntactic quality of the
data (are all the fields perfect?) and more
about the semantic quality of the data
(can we use this well?).
John Myers, "How to answer the top three objections to a data
lake." Info World. September 6, 2016
People who object to data lakes are only
defending the care, feeding, and maintenance of
a data warehouse. The types of 'needs' that this
objection is attempting to address are data
governance, quality, stewardship, and lineage.
John Myers, "How to answer the top three objections to a data lake." Info
World. September 6, 2016

Proprietary
And other people are not
We see customers creating big
data graveyards, dumping
everything into HDFS and
hoping to do something with it
down the road. But they just
lose track of what's there.
Sean Martin, Cambridge Semantics
Many companies are guilty of dumping
data into the data lake without a strategy
for keeping track of what's being
ingested. This leads to a murky, swampy
repository .... Unlike relational databases,
Hadoop is little help when it comes to
quality control.
Tony Fisher, Validating Data in the Data Lake. Zaloni Blog. December 15,
2016.
Without at least some
semblance of information
governance, the lake will end
up being a collection of
disconnected data pools or
information silos all in one
place.
Gartner, "Gartner says Beware the Data Lake Fallacy"
Some data lake initiatives have not
succeeded, producing instead more
silos or empty sandboxes.
Brian Stein and Alan Morrison, "The enterprise data lake: Better
integration and
deeper analytics." PWC Technology Forecast: Rethinking
Integration. Issue 1, 2014.

Proprietary
Big Data – Definition
The DMBOK2 points out:
• The term Big Data is associated with technological changes that have enabled people “to
generate, store, and analyze larger and larger amounts of data.”
• People us this data “to predict and influence behavior, as well as gain insight on a range of
important subjects, such as health care practices, natural resource management, and economic
development”. And Shopping.
Big Data goes hand-in-hand with data science:
• Changes in technology not only enable collection of huge amounts of data, they also enable
analysis of it.
• Data Science includes the creation of models that enable understanding of possible outcomes, if
variables change.

Proprietary
Big Data Production Process has Different Risks
Still a relationship between inputs, steps, and outputs, but more risk in the process.
Risk can be reduced through knowledge of the original production processes for the data.

Proprietary
Life Cycle Management
The same questions
apply to Big Data as
apply to traditional
data.
The same
connections exist
between data quality,
metadata and data
governance.

Proprietary
Big Data and the Product Metaphor
Big Data
• Volume
• Variety
• Velocity
• Veracity
Creating more data, of different kinds,
more quickly.
Different types of data have different
degrees of structure depending on
how they are produced.
• Production: Data comes from many places;
very little control over the inputs
• Inventory: Organizations do not know what
data they have, what condition it is in, what
relation it has to the processes that created
it, etc.
• Storage: The ways that we store data have
an impact on its quality
• Usage: Don’t know how data will be used –
bring this into question. We do not recognize
the connection between data production and
data uses.
• For Big Data, these problems are intensified.

Proprietary
Volume & Velocity: Impact on Veracity
Big Data Characteristics of volume and velocity have an effect on how veracity (truth, and from
there, quality) can even be defined.
Type of Data Volume Velocity Veracity
Mainframe Large but predictable Fast but predictable Measurable
Tabular Large but predictable Fast but predictable Measurable
Machine Generated Potentially huge Super fast Calibration
Unstructured Potentially huge
As fast as people can
produce it
What would this
even mean?

Proprietary
Variety
We associate Big Data with new kinds of data, but a lot of traditional data is also being stored in
data lakes.
Big Data is often referred to as “unstructured”, but it contains a lot of semi-structured and also
contains forms of data that are inherently structured by virtue of how they are collected.
Type of Data Example Inherent Structure
Trad Mainframe EBCDIC files High, but messy
Trad Tabular Warehouse tables High
Big Machine Generated Sensor data Very high
Big Unstructured Twitter Low

Proprietary
Big Data and Dimensions of Quality
Dimensions of quality provide a means to think about how to approach quality for big data.
Type of Data Completeness Format Consistency Validity Integrity Consistency
Mainframe
Number of
records generated
/ time period.
Constrained by
rules
Constrained by
rules
Can be
systematically
constrained
Expectation based
on the process the
data represents
Tabular
Number of
records generated
/ time period.
Constrained by
rules
Constrained by
rules
Can be
systematically
constrained
Expectation based
on the process the
data represents
Machine
Generated
Rate at which
data is collected
Constrained by
collection device
Based on
calibration of
collection
device
Depends on
consistent
collection
devices
Expectation based
on the process the
data represents
Unstructured ?? Not relevant Not applicable ??
No expectation of
consistency

Proprietary
Data Quality Challenges – Intensified by Big Data
• The semiotic challenge: People have different ways of representing the “same” concepts
– Traditional: GOVERNANCE
– Big: Governance, but at the category and metadata level
• The technical challenge: Different technical approaches to creating and using data influence the data itself.
– Traditional: DATA STANDARDS
– Big: Manage the ingest process, esp. manage metadata up front
• The knowledge challenge: Because data is complicated, a single individual cannot know all the data.
– Traditional: METADATA
– Big: Metadata is even more important
• The political challenge: Data is knowledge, knowledge is power, power is political
– Traditional: GOVERNANCE / CULTURE
– Big: Governance/Culture

Meeting Big Data Challenges
Let’s do this thing!

Proprietary
Meeting the Challenges for Big Data
METADATA – Addressing the knowledge challenge
• Production and Lineage: Data comes from many places;
very little control over the inputs. Need to know where data
comes from
• Inventory: Inventorying how much data an organization has
/ what data it has
• Storage: Need to understand how ingest and storage
process impacts data
• Usage: We will never know all the potential uses of data.
Ensure consumers know what the data represents, how it
was produced, how it is stored
• Metadata management: Enabling data usage by managing
knowledge of data; set minimum requirements for metadata
related to big data. The priorities change. It is not possible to
define every field in the way you would with traditional data
GOVERNANCE – Addressing data, process,
and cultural risks
• Accountability: Defining data ownership and
accountability
• Protection: Protecting against the misuse of
data
• Risk mitigation: Managing risks associated
with data
• Standards: Defining and enforcing standards
for data quality

Proprietary
Summary
Product Management practices do work for traditional
data.
They also work for Big Data, but with modifications
based on the production processes of Big Data.
Managing the quality of both Big Data and traditional
(little) data is dependent on managing metadata.
The process of figuring out how to manage your data
will significantly inform what you need to do to govern
your data via standards and monitoring.

Proprietary
Meeting the Challenges with Big and Little Data:
Characteristics of a Trusted Source of Data
1. SECURE: Data is protected against inappropriate access or use through policies, processes, and tools.
2. RELIABLE: Data processing is predictable and reliable. The system is monitored for performance. Controls are in
place to detect and respond to unexpected events.
3. DATA QUALITY IS KNOWN: The criteria for high quality data are defined. Levels of quality are measured and
reported on. Data issues are communicated to data consumers and remediated based on business priorities.
4. TRANSPARENT AND COMPREHENSIBLE: Data consumers have the information (Metadata) they need to
understand and get value from the data. Knowledge about the system and its data is documented, accessible,
usable, and current.
5. SUPPORTED: A dedicated production support team is in place and has the processes and protocols it needs to
respond in a timely manner to questions and issues related to the operation of the system and the data in the
system.
6. COMMUNICATED: New data consumers have access to relevant training; existing data consumers are informed of
changes that impact their uses of the data.
7. GOVERNED: Processes and accountabilities are in place to make decisions about the data in the system.

Proprietary
Goals from Agenda
Agenda
• Introductions
• Quality management concepts and principles
• Applying quality management to traditional data
• Big Data challenges
• Data Quality Practices for Big Data and Little Data

Thank you!
Laura Sebastian-Coleman
Sebastian-ColemanL@Aetna.com

Proprietary
References
DAMA International. DAMA Data Management Body of Knowledge. 2nd edition.
Technics, 2017.
The Data Governance Institute:
http://www.datagovernance.com/adg_data_governance_definition/
English, Larry. Improving Data Warehouse and Business Information Quality. John
Wiley & Sons, 1999.
Tony Fisher, Validating Data in the Data Lake. Zaloni Blog. December 15, 2016.
Gartner, “Gartner says Beware the Data Lake Fallacy”
https://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-
for-data-quality-improvement/
https://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-for-data-
quality-improvement/
https://www.google.com/search?q=cost+of+poor+quality+data&rlz=1C1GCEB_enUS
782US782&oq=cost+of+poor&aqs=chrome.2.69i57j0j69i59j0l3.3766j0j4&sourceid=c
hrome&ie=UTF-8
Anja Klein, Hong-Hai Do, Marcel Karnstedt Wolfgang Lehner Gregor
Hackenbroich. “Representing Data Quality for Streaming and Static Data”
https://www.researchgate.net/publication/4297383_Representing_Data_Quality
_for_Streaming_and_Static_Data
Vinu Kumar. “Solving Data Quality in Streaming Data Flows”
https://streamsets.com/blog/solving-data-quality-streaming-data-flows/
Viktor Mayer-Schonberger and Kenneth Cukier, Big Data: A revolution that will
transform how we live, work, and think.
Danette McGilvray. Executing Data Quality Projects: Ten Steps to Quality Data
and Trusted Information. Morgan Kaufmann, 2008.
John Myers, "How to answer the top three objections to a data lake." Info
World. September 6, 2016
Thomas Redman. “Bad Data Quality Costs the US $3 trillion per year.” Harvard
Business review. https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-
year
Laura Sebastian-Coleman. Measuring Data Quality for Ongoing Improvement.
Morgan Kaufmann, 2013. Appendices:
https://booksite.elsevier.com/9780123970336/downloads/Sebastian-
Coleman_Appendix%20E.pdf
Sunil Soares https://www.dataversity.net/big-data-governance-over-streaming-
data/#
TechTarget:
https://searchbusinessanalytics.techtarget.com/definition/unstructured-data
Video Coin. “The 5 Most Important Metrics To Measure The Performance of
Video Streaming” https://medium.com/videocoin/the-5-most-important-metrics-
to-measure-the-performance-of-video-streaming-ab41f4eb9d99
James Warner. "Innovative, unheard of use cases of streaming analytics“
https://internetofthingsagenda.techtarget.com/blog/IoT-Agenda/Innovative-
unheard-of-use-cases-of-streaming-analytics

Big Data Appendix

Proprietary
Data Quality Assessment and Monitoring Overview
93
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary
Source: Sebastian-Coleman. Measuring Data Quality for Ongoing
Improvement. Morgan Kaufmann, 2013.

Proprietary
Approaches in Traditional and Big Data Environments
TRADITIONAL INTEGRATED DATA
WAREHOUSE
Semiotic: Standardize data and definitions, via a
data model. Everyone will love it.
Knowledge: Rely on business SMEs for input.
They know everything. Information Architects will
capture knowledge in the data model.
Technical: Adopt a single technology to execute
ETL and integrate data. All data comes through
one route and is standardized via that route. If
possible, adopt a single BI tool.
Political: Reassure everyone that the data they
are used to will be “the same”
BIG DATA ENVIRONMENT (Lake, Fabric) ace]
Semiotic: Assume data from different sources will
fit together. If it doesn’t, people will figure it out,
they are data scientists after all.
Knowledge: Ask for a data dictionary but don’t
worry if you don’t get it. Assume that the people
requesting the data know what the data
represents so other people probably will, too
Technical: Allow multiple technologies, for
integration and analysis. Hope that people get the
same answers from the same data, even though
the tools work in totally different ways.
Political: Reassure everyone that the data is
correct because “it is what the source provided”

Proprietary

Proprietary
Define and Measure: Traditional Data
Define
• The data itself: know / document what the
data represents
• Expectations / quality characteristics for
the data
• Standards
• Rules
Measure: Actual data against expectations
Completeness Rule: Column is mandatory. It
must be populated.
Validity: Valid values include X,Y, Z. All other
values are invalid.
# of records populated with a valid value /
Total # of records =
Percentage of records that meet quality rule
This is a very simple example, but the idea can
be extended from columns, to files, to data
domains.

Proprietary
Why challenges are intensified with Big Data: The V’s
Variety: Measurement of quality depends
on data’s inherent structure.
Volume and Velocity affect how veracity
(truth, and from there, quality) can be
defined.
Type of Data Volume Velocity Veracity
Trad Mainframe Large but predictable Potentially fast but predictable
Measurable (compare to real world or other
data)
Trad Relational DB Large but predictable Potentially fast but predictable
Measurable (compare to real world or other
data)
Big Machine Generated Potentially huge Super fast, variable
Dependent on calibration of instrument /
collection device
Big Unstructured Potentially huge Super fast, variable What would this even mean?
VARIETY Type of Data Example Inherent Structure Structure driven by
Trad Mainframe EBCDIC files High, but messy
Design of the originating
system
Trad Relational DB
Warehouse
tables
High Data Model
Big
Machine
Generated
Streaming
sensor data
Very high
Design of the collection
device
Big Unstructured Twitter Low
Application interface,
Language of user

Proprietary
Why challenges are intensified with Big Data
Dimensions of quality provide a way to think about data quality measurement for Big Data
Type of Data Completeness Format Consistency Validity Integrity Consistency
Trad Mainframe
Number of records
generated / time period.
Comparison to a known
real-world population.
Constrained by system
rules
Constrained by rules
Can be systematically
constrained
Expectation based on
the process the data
represents
Trad Relational DB
Number of records
generated / time period.
Comparison to a known
real-world population.
Constrained by model,
can be systematically
constrained
Constrained by rules
Can be systematically
constrained
represents
Big Machine Generated
Rate at which data is
collected
Constrained by
collection device
Based on calibration of
collection device
Depends on consistent
collection devices
represents
Big Unstructured
No general definition of
“completeness”
Not applicable Not applicable Not applicable Not applicable
Machine generated – Quality depends on the machines that collect the data.
Unstructured – Quality depends on having adequate metadata describing individual data sets.

Proprietary
Approaches to DQ Measurement for Streaming Data
Streaming Video
• Data Quality characteristics =
product characteristics: Color,
speed, image resolution, sound /
image synchronization
• Risks:
• The initial data could be
corrupted
• Delivery system does not
deliver as expected
• Because the biggest risk is
interference with the delivery
system, Quality = Signal-to-Noise
ratio
Streaming Sensor Data
• Data Quality characteristics are
defined by calibrating the collection
device.
• Risks
• Devices calibrated
inconsistently
• Interference with a device
• Alignment between data from
related sensor streams
• Quality assessed through
• Metadata related to the
conditions of data collection.
• Monitoring temporal aspect of
delivery
• Patterns in data content
“Traditional” Data Content Streamed
• Data quality characteristics are similar to
those in traditional data (e.g., Field =
Mandatory or optional; Criteria are defined
for Validity; “same” field has the “same”
content). Risks
• Data collected incorrectly
• Data lost in process of delivery
• Measurement done instream: compare
incoming data to existing data (e.g.,
reference or master data; content of
existing ‘records’)
• Quality = the level of exceptions; challenge
in establishing the denominator for any
measurement (e.g., likely a timeframe,
rather than a population of records)

Dw19 t1+ +dq+fundamentals-cvs+template

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dw19 t1+ +dq+fundamentals-cvs+template

Similar to Dw19 t1+ +dq+fundamentals-cvs+template (20)

Recently uploaded

Recently uploaded (20)

Dw19 t1+ +dq+fundamentals-cvs+template