SlideShare a Scribd company logo
1 of 101
Download to read offline
Proprietary
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 1
#damaweek2019
Data Quality
Fundamentals
Laura Sebastian-Coleman, Ph.D., CDMP
November 2019
Data Quality
Fundamentals
Laura Sebastian-Coleman, Ph.D., CDMP
Data Quality Lead
Shared Services Enterprise Data Governance, CVS Health
DAMA Days – DAMA Mexico
November 2019
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary
Proprietary
Abstract: DQ Fundamentals
• Organizations today get value from their data in the face of challenging odds.
• Optimal management of traditional data requires a wide skillset and a strategic perspective.
• Changes in technology have increased the volume, velocity, and variety of data, but many
organizations do not yet have a handle on veracity in traditional data management
environments, never mind big data environments.
• And, while big data is on the rise, more traditional forms of data are not going away. Instead,
different kinds of data will co-exist and must be managed in conjunction with one another.
• This tutorial will revisit the fundamentals of data quality management in the light of big data and
explore how to apply them in traditional and big data environments.
• Participants will learn how to assess the current state of their data environment and deliver
more reliable data to their stakeholders.
Proprietary
About me
Data quality practitioner in the health care industry since 2003
Background in banking, manufacturing / distribution, commercial insurance, and academia
Publications
– Author, Navigating the Labyrinth: An Executive Guide to Data Management (2019)
– Production Editor, DAMA Data Management Body Of Knowledge second edition, [DMBOK2] (2017)
– Author, Measuring Data Quality for Ongoing Improvement (2013)
Service
– Advisor, DAMA New England, 2019 - present
– DAMA Publications Officer, 2015 – 2019
– IAIDQ (now IQ International) Member Director, 2010-12
Recognition
– DAMA International Recognition for Outstanding Contributions to Data Management, 2019
– DAMA New England Award for Excellence in Data Management, 2019
– IAIDQ (now IQ International) Distinguished Member Award, 2015
Proprietary
About you
Proprietary
Abstract and Agenda
Abstract:
Organizations today get value from their data in the face of
challenging odds. Optimal management of traditional data
requires a wide skillset and strategic perspective.
Changes in technology have increased the volume, velocity, and
variety of data, but many organizations do not yet have a handle
on veracity in traditional data management environments, never
mind big data environments. And, while big data is on the rise,
more traditional forms of data are not going away. Instead,
different kinds of data will co-exist and must be managed in
conjunction with one another.
This tutorial will revisit the fundamentals of data quality
management in the light of big data and explore how to apply
them in traditional and big data environments. Participants will
learn how to assess the current state of their data environment
and deliver more reliable data to their stakeholders.
Agenda
• Introductions
• Quality management concepts and
principles
• Applying quality management to traditional
data
• The role of measurement and monitoring
• Big Data challenges
• Data Quality Practices for Big Data and
Little Data
Proprietary
Why Data Quality matters: Because data is valuable
Proprietary
Why DQ Management Matters: Poor quality data cost money
• Reports differ, but many estimate that between 10-30% of productivity is lost due to poor quality
data.
• This seems low, since one report indicated that Data scientists spend 60% of their time
cleansing data.
• IBM estimated that data quality problems cost the US $3 Trillion in 2016.
Unproductive
Time
30%
Productive
Time
70%
10-30% of productivity is lost
due to poor quality data
Time spent
Cleansing
Data
60%
Time spent
Analyzing
Data
40%
Data Scientists' time:
Cleansing vs. Analyzing Data
Quality Management Concepts
A short history of an important idea
©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 9
Proprietary
Definition of Quality: Fitness for Purpose / Fitness for Use
Data Quality: A measure of the degree to which data is fit for
the purposes of the people, processes, and systems that use
the data.
The concept of “fit for purpose” directly relates data quality to
the quality of manufactured products.
Data = a Product. Data NOT a by-product.
“Fit for Purpose” also relates data quality to the concept of a data
consumer – a person, process, or a system that uses data.
Data Quality Management: A set of activities intended to
ensure that data is fit for purpose by data consumers.
Proprietary
Manufacturing: A brief history of mass-produced products
19th Century Industrial Revolution:
• Steam power
• Interchangeable parts
• Development of large factories
• Production line manufacturing processes
• Machine tooling
20th Century Mass Production:
• Machine tooled interchangeable parts
• Assembly line
• Vertical integration of the manufacturing process
• Quality control
Proprietary
Power, Process, Technology, and Standardization enabled Vertical
Integration of Manufacturing
Proprietary
Pioneers of Quality Control
• Defined criteria for quality based on customer
expectations
• Recognized the relation between a well-defined
process and a predictable outcome
• Used measurement to manage and improve
processes
• Created tools to assess and improve product
quality
• Recognized that producing a quality product
requires life cycle management, supply chain
management, and leadership commitment
Proprietary
Quality Control in Manufacturing: Product and Process
A process is a series of steps that turn inputs into outputs.
• The better quality the inputs
• The better defined the steps
• The better quality the outputs
Add to this the idea that the execution of processes can be improved through observation, analysis, and
feedback.
The more consistent the input and the more consistently the process is executed, the more consistent the
result.
Proprietary
Quality and the Customer
Thought leaders in Quality Control / Quality improvement recognize that there is a customer at the end of the
assembly line: Someone wants to buy the product.
That person has expectations at two levels:
• At the very least, the Product must perform its primary function.
• Ideally, the Product also pleases the customer in some way.
Dimensions of Product Quality (from David Garvin)
– Performance: The product operates as expected.
– Features: The product has additional characteristics that please the customer.
– Reliability: The product works well. The customer can count on it.
– Conformance: The product meets standards.
– Durability: The product lasts for an expected amount of time.
– Serviceability: If the product breaks it can be fixed.
– Aesthetics: The product is attractive and pleasing.
– Perceived Quality: The customer feels good about the product.
Proprietary
Intention and quality: Quality is not accidental
Source: Kaizen institute of India.
https://kaizeninstituteindia.wordpress.com/2013/10/08/quality-is-
not-an-act-it-is-a-habit/
Proprietary
Life Cycle Management
Life Cycle management is an extension of the idea of quality
control to all aspects of creating a product.
Proprietary
The Role of Measurement in Quality Control
Statistical process control – a means to
measure the consistency of processes
Measurement formalizes expectations
Monitoring ensures unexpected
variation within the system is detected
Applying Quality Management
Concepts to Data
Produce data like we produce other
products
©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 19
Proprietary
Definition: Data Quality Management
Data Quality: A measure of the degree to which data is fit for the purposes of the people and
systems that use the data.
Data Quality Management: A set of activities intended to ensure that data is fit for purpose,
including:
• Data quality assessment
• Data quality requirements definition
• Data quality monitoring
• Data issue detection
• Issue remediation
• Reporting on data quality
• Improving business and technical process to ensure data is of high quality
What you mean
by high quality
data
How you detect
low quality data What you do
about low
quality data
All data management
processes have the
potential to impact the
fitness of data for use.
Not every process
needs to be called “data
quality” processes.
Core Data Quality
processes have
foundational, project-
oriented, and
operational
components.
Proprietary
Stuff Connected with DQ Management
but are not exactly DQ Management
Quality Assurance
• Kinda DQ: QA focuses on quality.
• Not Quite DQ: Focus on functionality, may or may not
include data. Project process, rather than ongoing
process.
System Controls (manage data movement)
• Kinda DQ: System Controls help confirm data
completeness – they show that you have not lost
data.
• Not Quite DQ: You need them to run the system,
regardless of the quality of data.
Architecture / System Design to enforce quality
• Kinda DQ: System design can directly impact the
quality of data
• Not Quite DQ: System design encompasses many
other things that are not data
Metadata Management
• Kinda DQ: You cannot understand data without
metadata
• Not Quite DQ: Metadata is a form of data. It requires
the same kind of DQ management that other forms
of data require.
Data Stewardship
• Kinda DQ: Stewards know a lot about data and
much of what they do helps us understand data
quality
• Not Quite DQ: Stewardship is wider than quality and
may not even focus on quality.
Data Cleansing
• Kinda DQ: It makes the data better. Isn’t that the
point?
• Not Quite DQ: Data cleansing is a solution to some
data quality issues. It is not a goal of DQ
Management.
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 21
Proprietary
Data as the Product of a Process
Process: A process is a series of steps that turn inputs into outputs
Proprietary
Data as the Product of a Process DQ problems
are usually
detected in
Data Output
But those problems can
be caused at any point in
the production or
consumption process
Data Quality is understood in terms of outputs
• Expected outputs = Good Quality Data
• Unexpected outputs = Poor Quality Data
Proprietary
Complexity increases risks associated with data
Risk multiplies as data
moves along the data
chain from system-to-
system, use-to-use.
Proprietary
Intention: Data Quality Improvement via PDCA
The same processes that are applied to improve the
quality of manufactured products can be applied to
improve the quality of data.
Different improvement methodologies use essentially
the same process.
• Six Sigma
• Lean
• Total Quality Management
Proprietary
Include Simplified
Improvement cycle
©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 26
Define Quality:
Expectations,
Standards, Rules
Assess Data Against
Expectations,
Standards, Rules
Define
Measurement /
Monitoring
requirements
Monitor Data
Quality
Report on Data
Quality results
ManageData Issues
Identify and act on
Improvement
opportunities
Basis for
Assessing
Quality
Revise / Improve
Provide
Input
to
Provide
Input to
Manage
issues
Proprietary
©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 27
Define Quality:
Expectations,
Standards, Rules
AssessData Against
Expectations,
Standards, Rules
Define
Measurement /
Monitoring
requirements
Monitor Data
Quality
Report on Data
Quality results
ManageData Issues
Identify and act on
Improvement
opportunities
Basis for
Assessing
Quality
Revise /
Improve
Provide
Input
to
Provide
Input to
Manage
issues
Requires:
 Standards for rules
 Criteria for criticality
 SME / Data Consumer input
 Working set of CDEs
 Feedback process
 Maintenance process
Requires:
 Accessto data
 Profiling engine and process
 Evaluation methodology
 SME / Data Consumer input
Requires:
 Standards for rules
 Analysis of historical data
 Specification template
 SME / Data Consumer input
 Staff to implement and
maintain
Requires:
 Guidelines and goals for monitoring
 Processautomation / tooling
 Staff to review and respond
 Response protocols
 Access to system and business SMEs
Requires:
 Goalsbased on SME / Data
Consumer input
 Reporting standards/
templates
 Reporting tool
 Schedule
Requires:
 Processflow
 Issuedefinition template
 Prioritization criteria
 Escalation path
 Tracking tool
 SME / Data Consumer input
 Accessto decision makers
Requires:
 Knowledgeof businessgoals
 Knowledgeof data issues
 SME / Data Consumer input
 Root cause analysisskills
 Proposal process
 Funding process
Proprietary
Data Quality
Improvement Cycle
Proprietary
Data Life Cycle Management
Adapted from Danette McGilvray,
Executing Data Quality Projects: Ten
Steps to Quality Data and Trusted
Information
Proprietary
Manage Data Quality throughout the Data Life Cycle
Managing quality
throughout the data life
cycle requires
• Data Governance
• Metadata Management
Adapted from Danette McGilvray,
Executing Data Quality Projects: Ten
Steps to Quality Data and Trusted
Information
The Role of Measurement and
Monitoring
You cannot manage what you cannot
measure
©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 31
Proprietary
Dimensions of Data Quality – Why they matter
• Data quality dimensions function in the way that length,
width, and height function to express the size of a physical
object.
• They allow understanding of quality in relation to a scale
and in relation to other data measured against the same
scale.
• Data quality dimensions can be used to define
expectations (the standards against which to measure) for
the quality of a desired dataset, as well as to measure the
condition of an existing dataset.
• Dimensions provide an understanding of why we measure
(what question a measurement answers). For example, to
understand the level of completeness, validity, and integrity of
data.
• Dimensions also help us identify things that we cannot
measure or that there is little value in measuring.
Proprietary
Data Quality / Quality of Data
Data Quality / Quality of Data: A measure of the degree
to which data is fit for the purposes of the people and
systems that use the data.
What contributes to data’s “fitness for purpose”?
• Representational Effectiveness: How well and
consistently data represents the concepts it stands for.
• Data Knowledge: How well data consumers understand
and can de-code the data.
• Dimensions of Quality: How well data conforms to
expectations expressed via measurable Characteristics
of Quality
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 33
Proprietary
Dimensions of Data Quality
A Dimension of Data Quality is a characteristic that of data that can
be measured and through which its quality can be quantified.
There are many frameworks that define DQ dimensions. There is not
an agreed-to set. However all account for similar concepts, which
have a common sense meaning.
• COMPLETENESS: You have all the pieces of data you need or expect to
have.
• FORMAT CONFORMITY: Data is in the form you expect it to be in.
• VALIDITY: Data values belong to the set of possible (expected) values.
• INTEGRITY: Different pieces of data relate to each other in the ways you
expect them to.
• CONSISTENCY: Data follows patterns that you expect it to follow.
Proprietary
Data Quality Issue / Data Quality Improvement
Data Quality Issue: A data quality issue is
any condition of the data that is an obstacle
to a data consumer’s use of the data,
regardless of the root cause of the obstacle.
• Issues can be caused by actual errors – a
person or a process made a mistake.
• Or by any of the challenges inherent in data
• People misunderstand / misinterpret
• Technology does unexpected things
• People disagree
Data Quality Improvement: A Data Quality
improvement is a measurable, positive
change in data that makes it more fit for use.
In other words: any change that reduces or
removes an obstacle.
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 35
Proprietary
Logical Relationship between DQ Dimensions
• COMPLETENESS: You have all the pieces of data you need or expect to have.
If you do not have all the data you need, then other measurements of quality may not even matter.
• FORMAT CONFORMITY: Data is in the form you expect it to be in.
If data is not in the right format, then it cannot be valid or relate to other data in the ways you expect.
• VALIDITY: Data values belong to the set of possible (expected) values.
If the data values are not in the allowed set of values, then they cannot be correct.
• INTEGRITY: Different pieces of data relate to each other in the ways you expect them to.
If the different pieces of data do not fit together in the ways you expect, then you cannot use the data in the way
you intended to.
• CONSISTENCY: Data follows patterns that you expect it to follow.
If the data does not follow expected patterns, then you will want to understand why (Change in the pattern or error
in the data?)
Proprietary
Using DQ Dimensions to Create Standards & Rules
Standard: A level of quality or attainment (a high standard for customer
service); an idea or thing used as a measure, norm, or model in comparative
evaluations (e.g., ISO standards)
Rule: one of a set of explicit or understood regulations or principles
governing conduct within a specific activity or sphere (e.g., Roberts Rules of
Order); a principle that operates within a particular sphere of knowledge,
describing or prescribing what is possible or allowable.
Business Rule: A business rule is a rule that defines or constrains some aspect
of business and always resolves to either true or false. Business rules are
intended to assert business structure or to control or influence the behavior of
the business.
STANDARDS represent an intersection of Data Quality Management and Data
Governance. Standards are also a form of metadata. DQ uses them to
measure, but they are also simply useful to explain expectations.
Proprietary
Benefits of Rules
• Common vocabulary – People understand expectations
in a similar way
• Consensus – People agree to the same things.
• Differences – People can also disagree about rules.
Rules provide a means of surfacing and therefore
clarifying different expectations.
• Simplicity – People make decisions once
• Predictability – People know what is expected of them
and they try to achieve it
• And with data… they give us a way to talk more
objectively about quality
Proprietary
Using DQ Dimensions to Create Standards & Rules
Dimensions of quality are the foundation of a common
vocabulary through which to articulate expectations
for quality. They can be used to:
• Create standards and rules & controls for models
and applications
• Establish measurements
• Report problems consistently
For example, completeness can be understood at
several levels: system, data set, record, field.
Dimension Data Object Rule
Complete-
ness Data Set
The number of [distinct entity] must be
equal to the number of [distinct entity] in
[Source]
Complete-
ness Data Set
[Amount field] must reconcile to [Amount
field] in [Source]
Complete-
ness Field Must be populated
Complete-
ness Field
Must be populated; standard default value
allowed
Complete-
ness
Optional
fields, with
population
rules Must be populated when …
Complete-
ness
Optional
fields, with
population
rules
Must be populated except when …
Must NOT be populated when …
Proprietary
Logic for Field Level Completeness
Dimensions enable you to
consistently describe the
characteristics you are looking
for.
Many rules can be defined
through a logical progression of
questions related to the
dimension.
Here is an example focused on
completeness at the field level.
Similar questions could be asked
at the system or file level.
Proprietary
Logic for Format Conformity
As with completeness, with conformity,
we can establish a logical progression
of questions to define expectations at
the field level.
Some fields have one-and-only-one
acceptable format.
More complex fields may have a set of
format requirements.
Others are constrained only by data
type and format may not indicate much
about the quality of data.
Proprietary
Logic for Validity
The word validity is used to refer in a
general sense to whether or not the data is
“good”.
As a dimension of quality, it refers to
whether values are part of a defined
domain.
Validity Rules can be based on how the
domain of values is defined.
Proprietary
Sample Rule Syntax – Validity for Codified Data
Working through the decision tree
results in a standard syntax for
expressing rules.
Rules can be used to:
• Clarify expectations about quality
• Measure data quality
• Report on data quality
Depending on how much we know
about the data, they can also be used to
transform data. For example, to
populate a consistent default value for
all invalid values.
Dimension Data subset Rule Meaning
Validity
Codified
data
Valid values are
limited to: [List of
valid values]
Specifies a list of values that
are valid. All other values
are invalid.
Validity
Codified
data
Values must exist in
[code table / column
…]
Specifies the code table and
the column in the code
table in which valid values
are stored. All other values
are invalid.
Validity
Codified
data
The range of valid
values is between:
[MIN] and [MAX]
Provides the MIN and MAX
value for the range. Any
values outside of the
MIN/MAX are invalid.
Validity
Codified
data
Invalid values include:
[List of invalid values]
Specifies a list of values that
are not valid. All other
values are valid.
Proprietary
Using DQ Dimensions to Create Good Measurements
Characteristics of Good Measurements
• Meaningful: They are focused on characteristics that are important.
» Think: taking a child’s temperature
• Comprehensible: They present information that people can understand.
» Think: understanding how to read a thermometer
• Actionable: They allow people to make a decision or take an action.
» Think: knowing what to do when the temperature is higher than normal
Dimensions of help in all of these
• Meaningful: They are focused on characteristics that are important
– They define what = GOOD Quality and what = BAD Quality
• Comprehensible: They present information that people can understand.
– They allow people to understand what is RIGHT or WRONG with the data – it is incomplete, invalid, etc.
• Actionable: They allow people to make a decision or take an action.
– They allow people to decide whether or not the data they want to use is FIT FOR PURPOSE
Proprietary
Sample Data Quality Standards
Data Element Completeness Conformity Validity Integrity
Date of Birth Must be populated
Must conform to the date
format requirements of
the system in which it is
present
Cannot be a future date; person
cannot be older than 120 years
(based on current date)
All occurrences of records for an
individual should have the same
date of birth
Birth Gender Must be populated See Validity
Values must exist in [list OR code
table / column …]
All occurrences of records for an
individual should have the same
birth gender
Gender Identity
Optional -- no known
rules for population
See Validity
Values must exist in [list OR code
table / column …]
All occurrences of records for an
individual within a time frame
should have the same gender
identity
Proprietary
Sample Data Quality Standards
Data Element Completeness Conformity Validity Integrity
Ethnicity
Optional -- no known rules
for population
See Validity
Values must exist in [list OR
code table / column …]
Optional field - no integrity
rules
Race
Optional -- no known rules
for population
See Validity
Values must exist in [list OR
code table / column …]
Optional field - no integrity
rules
Marital Status
Situational -- required for
some business processes
See Validity
Values must exist in [list OR
code table / column …]
Situational
Relationship to
subscriber
Must be populated See Validity
Values must exist in [list OR
code table / column …]
No rules identified; value
can change over time.
Proprietary
Applying DQ Dimensions – Lessons Learned
• The dimensions provided a new perspective on the data.
• Seeing the data separate from and in relation to systems
– What is optional in a system may be mandatory to a downstream process
• Translating common sense expectations about the data into consistent, objective criteria for reasonability
– Every person has a birth date and we can define a reasonable range for birth dates within a database
• Seeing gaps in expectations
– Marital status -- should everyone have a marital status, even a child? Or should only subscribers have this?
• Seeing that some concepts are not well-defined and may never be well-defined
– Race, ethnicity
• Some concepts that we once considered well defined are evolving
– Gender identity vs. birth gender
Despite the flux, we were able to come to consensus on our expectations – the dimensions provided a vocabulary to do so.
They allowed us to clarify expectations in a consistent manner.
Proprietary
Example Measurement
This measures the level of
completeness [MUST BE
POPULATED] of a critical field
It is very simple, because the
concept itself is very simple.
In many cases, you don’t need
more than this.
0.0000
0.1000
0.2000
0.3000
0.4000
0.5000
0.6000
0.7000
0.8000
0.9000
27/12/2018
29/12/2018
31/12/2018
02/01/2019
04/01/2019
06/01/2019
08/01/2019
10/01/2019
12/01/2019
14/01/2019
16/01/2019
18/01/2019
20/01/2019
22/01/2019
24/01/2019
26/01/2019
28/01/2019
30/01/2019
01/02/2019
03/02/2019
05/02/2019
07/02/2019
Trend
Proprietary
Example of Rolled Up Score Card with visual
©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 49
MAY 2019 Summary Scorecard Dimensions of Quality
Line of
Business
Source System Completeness Conformity Validity Overall Score
Upper
Threshold
Lower
Threshold
Status
ABC
ABC1 97% 93% 100% 96.5% 98.5% 95.0% Amber
ABC1 97% 94% 100% 96.7% 98.5% 95.0% Amber
DEF
DEF1 96% 93% 100% 96.1% 98.5% 95.0% Amber
DEF2 97% 91% 100% 96.1% 98.5% 95.0% Amber
GHI GHI1 99% 98% 100% 98.7% 98.5% 95.0% Green
JKL JKL1 96% 98% 100% 98.0% 98.5% 95.0% Amber
MNO MNO1 97% 93% 100% 96.7% 98.5% 95.0% Amber
Proprietary
©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 50
90.0%
91.0%
92.0%
93.0%
94.0%
95.0%
96.0%
97.0%
98.0%
99.0%
100.0%
Overall Data Quality
Nov-18 Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19
ABC1 ABC2 DEF1 DEF2 GHI1 JKL1 MNO1
Limitations of “Data as a Product”
OR Why Data is Different
Subtitle
©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 51
Proprietary
Limitations of the Product Metaphor
The Challenges
• Data is not a physical product.
• Data is not tangible, but it is durable
• Data is easy to copy but very hard to reproduce from
scratch
• The same data can be used by multiple people and
processes at the same time
• Data is volatile
• The value of data changes based on context and
timing
• Using data often results in new data
Adapted from DMBOK2, chapter 1, which is adapted from Redman, Data Driven.
The Risks
An organization does not know what data it has
Data can be lost, breached, or misused
Data is replicated and variation is created between data
sets
The quality of data deteriorates over time or across
functions
Knowledge of data deteriorates within the organization
NOT JUST FITNESS FOR PURPOSE
Representational effectiveness: How well and
consistently data represents the concepts is it
intended to represent
Data Knowledge: How easily and well data
consumers can “decode” data
Proprietary
Power, Process, Technology, and Standardization enabled Vertical
Integration of Manufacturing
Proprietary
We don’t treat data as a product
• Production: Data comes from many places; very little control over the inputs
• Inventory: Organizations do not know what data they have, what condition it is in, what relation it has to the
processes that created it, etc.
• Storage: The ways that we store data have an impact on its quality, but we do not always account for this when we
work with data.
• Usage: Don’t know how data will be used – bring this into question. We do not recognize the connection between data
production and data uses
What would happen if we
treated physical products
the way that we treat data?
Proprietary
What is the Product? Who is the customer?
The challenges with the product concept are
related to how data evolves within the data
lifecycle and to different levels of awareness
of data as a product along the data chain:
• Evolution: Data has many uses, and these
uses change over time.
– Example: Mail order companies once wanted your
address simply to ship you a product. Now they
want it to understand customer demographic
patterns.
• Evolution: Once people start using data, they
want to refine data.
– Example: Transition from ICD-9 to ICD-10
represents a refinement of diagnosis codes
• Data Chain: Data that meets its initial quality criteria may
not be of high quality for downstream uses.
– Example: Data may be good enough to enable a claim to
be adjudicated, but not good enough to do outreach to a
member
• Data Chain: Many upstream processes are not aware of
the downstream uses of data.
– A field that is not required for Provider Demographics
may be required to assess quality of care
Proprietary
DQ Challenges stem from the nature of data
• The semiotic challenge: People have
different ways of representing the “same”
concepts. Data is disparate.
• The knowledge challenge: Because data is
complicated, a single individual cannot
know all the data.
• The technical challenge: Different technical
approaches to creating and using data
influence the data itself and impact its
quality.
• The political challenge: Data is knowledge,
knowledge is power, power is political.
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 56
The simple answer: Define, Measure, Monitor
• Define: To reduce ambiguity
• Measure: To confirm the actual state of the
data
• Monitor: Detect change over time
More complicated answer: These are hard to do
Proprietary
Proprietary
The Semiotic Challenge: Reality and Data
Source: Measuring Data Quality for Ongoing
Improvement. By Laura Sebastian-Coleman
(Morgan Kauffmann, 2013)
Proprietary
The Semiotic Challenge: Data and Reality
Source: Measuring Data Quality for Ongoing
Improvement. By Laura Sebastian-Coleman
(Morgan Kauffmann, 2013)
Proprietary
Data Quality as a Technical Challenge
Different technical approaches to creating and
using data influence the data itself.
Example: SAS vs. Hadoop rounding
difference
(Desc and acct numbers modified for example)
Problem: When data is extracted
in Hadoop & SAS using the same
query, there is a difference in the
number of records extracted
(7,192 records for Jan-18 period).
Observation: All records have
YTD_Actual less than 50 cents,
absolute value
Hypothesis: It looks like Hadoop
rounds differently than SAS,
which resulted in data that has
values rounded to '0' in
'YTD_ACTUAL' field was
excluded from the query.
PERIOD_KE
Y
MAJ_ACCT_DESC MIN_ACCT_DESC
YTD_ACTU
AL
Jan-18 Account 1 Health Management, LLC 0.01
Jan-18 Account 2 MEDICARE -0.01
Jan-18 PREPAID EXP Commissions-HMO Based Product -0.02
Jan-18 CURR & DEF'D TAXES CUR INC TAXES - STATE 0.29
Jan-18 MISC LIAB Settlements -0.1
Jan-18 EXP - GEN'L Telecom Comm Equip:Owned -0.39
Jan-18 EXP - GEN'L Phone - Local/Long Distance 0.13
Jan-18 EXP - GEN'L Phone - Local/Long Distance 0.33
Proprietary
Data Quality: The Knowledge Challenge
The knowledge challenge: In any organization, data is more complicated than a
single person can comprehend. Because data is complicated, it cannot be
managed without metadata (documented knowledge about the data).
The challenge goes beyond knowledge of the data to knowledge of how to manage
data quality. It includes:
• Unexplored assumptions about data and data management – some of which we
covered in the semiotic challenge.
• Lack of consensus about the meaning of key concepts (Data, Data Quality, Data
Quality measurement) – which is why I started with definitions.
• Lack of clear goals and deliverables for the data assessment process.
• Lack of a methodology for defining “requirements”, “expectations” and other
criteria for the quality of data at the level needed for measurement. These
criteria are necessary for measurement.
Proprietary
Data Quality as a Political Challenge
The political challenge: Data is knowledge, knowledge is power, power is political
Etymology of Politics:
• Poli = many
• Ticks = blood sucking vermin
Most people dislike politics.
People do not always mean to be political about data.
But data represents business processes, so people are protective.
Their data may be high quality OR it may be low quality data OR they may not know.
Data is about knowledge and people like to be knowledgeable. No one likes to feel
“un- knowledgeable” (i.e., dumb).
Overcoming the Knowledge
Challenge
Through Profiling and Data Inspection
©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 63
Proprietary
Definition: Data Quality & Data Quality Management
Data Quality: A measure of the degree to which data is fit for the purposes of the people and systems that use the data.
Data Quality Management: A set of activities intended to ensure that data is fit for purpose, including:
• Data quality assessment
• Data quality requirements definition
• Data quality monitoring
• Data issue detection
• Issue remediation
• Reporting on data quality
• Improving business and technical process to ensure data is of high quality
All data management processes have the potential to impact the fitness of data for use. But not every data management
process needs to be called a “data quality” process.
This is the activity of running
the engine and looking at the
results (data profiling and
analysis).
Analysis of profiling results
supports these activities
Proprietary
Definition: Data Profiling
• Assessment is the process of evaluating or estimating the nature, ability, or quality of a thing.
• Data quality assessment is the process of evaluating data to identify errors and understand their
implications (Maydanchik, 2007).
• Data profiling is a specific kind of data analysis used to discover and characterize important
features of data fields and data sets, including:
– Data types
– Field lengths
– Cardinality of columns
– Granularity
– Existing values
– Format patterns
– Content patterns
– Implied rules
– Cross-column and cross-file
data relationships
Proprietary
Data Profiling Inputs and Outputs
Proprietary
Profiling Goals – Overcoming the Knowledge Challenge in Projects
• Reduce risks related to data development
• Enable initial assessment of source-supplied metadata to reduce the risk of errors related to incorrect identification
of data fields
• Identify risks and obstacles to use of sources (data issues, incorrect assumptions, differences in data granularity,
naming conventions, etc.)
• Accurately identify encryption requirements
• Identify critical data for ongoing data quality measurement, monitoring, and reporting
• Improve project process efficiency
• Improve the quality and consistency of system metadata, beginning with table and column definitions
• Provide input to mapping, including conformance
• Provide input to data modeling
• Provide input to ETL design, including system controls
• Provide input for Quality Assurance and User Acceptance Testing
• Enable Governance over time
• Data quality monitoring
• Improved metadata
Proprietary
DART – Data Analysis Results Template
The DART’s worksheets break down into four groups:
• Reference information: Five tabs describe the template and provide guidance on how to observe data characteristics
in profiling results
• Project information: Two tabs bookend the process. One for project goals, the other for summarized findings and
action items.
• Findings and analysis: Four tabs that make up the core of the template and allow analysts to consistently document
what they see. (Note: At this time, the DQ Analysis tab will not be used by projects)
• DQ specification: Captures details for DQ measurements. Input for this will come from the findings and analysis tabs
REFERENCE TABS
TEMPLATE PURPOSE AND
OVERVIEW
GUIDELINES and USAGE NOTES
DQ CHECK LIST
FIELD DEFINITIONS
DOWNLOADS
PROJECT ADMIN TABS
PROJECT DETAILS
SIGN OFFS and ACTION ITEMS
ANALYSIS AND FINDINGS TAB
CONTEXT and METADATA RESULTS
TABLE LEVEL RESULTS
COLUMN LEVEL For Project Work
SPECIFICATION TAB
DQ MEASUREMENT
SPECIFICATION
Proprietary
The DART
Process and
Results
Proprietary
Example Overall Findings
FINDING CATEGORY COUNT PERCENTAGE
No Data -- 100% defaulted 85 35%
Data appears as expected 64 26%
Technical field 21 9%
Data Differs from Metadata 19 8%
Questionable Values 17 7%
Should be encrypted, is not 10 4%
Sparse Data -- 99% Defaulted 9 4%
Questionable Population 7 3%
Conformance Risk 6 2%
Questionable Population and
Values 4 2%
TOTAL 242 100%
0
10
20
30
40
50
60
70
80
90
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
Proprietary
Making findings actionable
FINDING CATEGORY COUNT PERCENTAGE ACTION
Data appears as expected 64 26% No action
Technical field 21 9% No action
No Data -- 100% defaulted 85 35% Determine impact to project
Data Differs from Metadata 19 8% Clarify with source system; update metadata
Questionable Values 17 7% Clarify with source system; update metadata
Sparse Data -- 99% Defaulted 9 4% Clarify with source system; update metadata
Questionable Population 7 3% Clarify with source system; update metadata
Questionable Population and Values 4 2% Clarify with source system; update metadata
Conformance Risk 6 2% Inform BA's and Modelers
Should be encrypted, is not 10 4% Revise encryption requirements for file
TOTAL 242 100%
0
10
20
30
40
50
60
70
80
90
COUNT Of FINDING CATEGORYNo action required
Determine if there is impact
Clarify with Source System
Manage within the workstream
• Inform BA’s and Modelers
• Update requirements
Big Data Challenges
Because it is not getting simpler…
©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 72
Proprietary
Some people are optimistic about Big Data
Often, big data is messy,
varies in quality .... What we
lose in accuracy at the micro
level we gain in insight at the
macro level.
Viktor Mayer-Schonberger and Kenneth Cukier, Big
Data: A revolution that will transform how we live,
work, and think.
A data lake's data quality practices are
less about the syntactic quality of the
data (are all the fields perfect?) and more
about the semantic quality of the data
(can we use this well?).
John Myers, "How to answer the top three objections to a data
lake." Info World. September 6, 2016
People who object to data lakes are only
defending the care, feeding, and maintenance of
a data warehouse. The types of 'needs' that this
objection is attempting to address are data
governance, quality, stewardship, and lineage.
John Myers, "How to answer the top three objections to a data lake." Info
World. September 6, 2016
Proprietary
And other people are not
We see customers creating big
data graveyards, dumping
everything into HDFS and
hoping to do something with it
down the road. But they just
lose track of what's there.
Sean Martin, Cambridge Semantics
Many companies are guilty of dumping
data into the data lake without a strategy
for keeping track of what's being
ingested. This leads to a murky, swampy
repository .... Unlike relational databases,
Hadoop is little help when it comes to
quality control.
Tony Fisher, Validating Data in the Data Lake. Zaloni Blog. December 15,
2016.
Without at least some
semblance of information
governance, the lake will end
up being a collection of
disconnected data pools or
information silos all in one
place.
Gartner, "Gartner says Beware the Data Lake Fallacy"
Some data lake initiatives have not
succeeded, producing instead more
silos or empty sandboxes.
Brian Stein and Alan Morrison, "The enterprise data lake: Better
integration and
deeper analytics." PWC Technology Forecast: Rethinking
Integration. Issue 1, 2014.
Proprietary
Big Data – Definition
The DMBOK2 points out:
• The term Big Data is associated with technological changes that have enabled people “to
generate, store, and analyze larger and larger amounts of data.”
• People us this data “to predict and influence behavior, as well as gain insight on a range of
important subjects, such as health care practices, natural resource management, and economic
development”. And Shopping.
Big Data goes hand-in-hand with data science:
• Changes in technology not only enable collection of huge amounts of data, they also enable
analysis of it.
• Data Science includes the creation of models that enable understanding of possible outcomes, if
variables change.
Proprietary
Data as the Product of a Process DQ problems
are usually
detected in
Data Output
But those problems can
be caused at any point in
the production or
consumption process
Data Quality is understood in terms of outputs
• Expected outputs = Good Quality Data
• Unexpected outputs = Poor Quality Data
Proprietary
Big Data Production Process has Different Risks
Still a relationship between inputs, steps, and outputs, but more risk in the process.
Risk can be reduced through knowledge of the original production processes for the data.
Proprietary
Life Cycle Management
The same questions
apply to Big Data as
apply to traditional
data.
The same
connections exist
between data quality,
metadata and data
governance.
Proprietary
Big Data and the Product Metaphor
Big Data
• Volume
• Variety
• Velocity
• Veracity
Creating more data, of different kinds,
more quickly.
Different types of data have different
degrees of structure depending on
how they are produced.
• Production: Data comes from many places;
very little control over the inputs
• Inventory: Organizations do not know what
data they have, what condition it is in, what
relation it has to the processes that created
it, etc.
• Storage: The ways that we store data have
an impact on its quality
• Usage: Don’t know how data will be used –
bring this into question. We do not recognize
the connection between data production and
data uses.
• For Big Data, these problems are intensified.
Proprietary
Volume & Velocity: Impact on Veracity
Big Data Characteristics of volume and velocity have an effect on how veracity (truth, and from
there, quality) can even be defined.
Type of Data Volume Velocity Veracity
Mainframe Large but predictable Fast but predictable Measurable
Tabular Large but predictable Fast but predictable Measurable
Machine Generated Potentially huge Super fast Calibration
Unstructured Potentially huge
As fast as people can
produce it
What would this
even mean?
Proprietary
Variety
We associate Big Data with new kinds of data, but a lot of traditional data is also being stored in
data lakes.
Big Data is often referred to as “unstructured”, but it contains a lot of semi-structured and also
contains forms of data that are inherently structured by virtue of how they are collected.
Type of Data Example Inherent Structure
Trad Mainframe EBCDIC files High, but messy
Trad Tabular Warehouse tables High
Big Machine Generated Sensor data Very high
Big Unstructured Twitter Low
Proprietary
Logical Relationship between DQ Dimensions
• COMPLETENESS: You have all the pieces of data you need or expect to have.
If you do not have all the data you need, then other measurements of quality may not even matter.
• FORMAT CONFORMITY: Data is in the form you expect it to be in.
If data is not in the right format, then it cannot be valid or relate to other data in the ways you expect.
• VALIDITY: Data values belong to the set of possible (expected) values.
If the data values are not in the allowed set of values, then they cannot be correct.
• INTEGRITY: Different pieces of data relate to each other in the ways you expect them to.
If the different pieces of data do not fit together in the ways you expect, then you cannot use the data in the way
you intended to.
• CONSISTENCY: Data follows patterns that you expect it to follow.
If the data does not follow expected patterns, then you will want to understand why (Change in the pattern or error
in the data?)
Proprietary
Big Data and Dimensions of Quality
Dimensions of quality provide a means to think about how to approach quality for big data.
Type of Data Completeness Format Consistency Validity Integrity Consistency
Mainframe
Number of
records generated
/ time period.
Constrained by
rules
Constrained by
rules
Can be
systematically
constrained
Expectation based
on the process the
data represents
Tabular
Number of
records generated
/ time period.
Constrained by
rules
Constrained by
rules
Can be
systematically
constrained
Expectation based
on the process the
data represents
Machine
Generated
Rate at which
data is collected
Constrained by
collection device
Based on
calibration of
collection
device
Depends on
consistent
collection
devices
Expectation based
on the process the
data represents
Unstructured ?? Not relevant Not applicable ??
No expectation of
consistency
Proprietary
Data Quality Challenges – Intensified by Big Data
• The semiotic challenge: People have different ways of representing the “same” concepts
– Traditional: GOVERNANCE
– Big: Governance, but at the category and metadata level
• The technical challenge: Different technical approaches to creating and using data influence the data itself.
– Traditional: DATA STANDARDS
– Big: Manage the ingest process, esp. manage metadata up front
• The knowledge challenge: Because data is complicated, a single individual cannot know all the data.
– Traditional: METADATA
– Big: Metadata is even more important
• The political challenge: Data is knowledge, knowledge is power, power is political
– Traditional: GOVERNANCE / CULTURE
– Big: Governance/Culture
Meeting Big Data Challenges
Let’s do this thing!
©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 85
Proprietary
Meeting the Challenges for Big Data
METADATA – Addressing the knowledge challenge
• Production and Lineage: Data comes from many places;
very little control over the inputs. Need to know where data
comes from
• Inventory: Inventorying how much data an organization has
/ what data it has
• Storage: Need to understand how ingest and storage
process impacts data
• Usage: We will never know all the potential uses of data.
Ensure consumers know what the data represents, how it
was produced, how it is stored
• Metadata management: Enabling data usage by managing
knowledge of data; set minimum requirements for metadata
related to big data. The priorities change. It is not possible to
define every field in the way you would with traditional data
GOVERNANCE – Addressing data, process,
and cultural risks
• Accountability: Defining data ownership and
accountability
• Protection: Protecting against the misuse of
data
• Risk mitigation: Managing risks associated
with data
• Standards: Defining and enforcing standards
for data quality
Proprietary
Summary
Product Management practices do work for traditional
data.
They also work for Big Data, but with modifications
based on the production processes of Big Data.
Managing the quality of both Big Data and traditional
(little) data is dependent on managing metadata.
The process of figuring out how to manage your data
will significantly inform what you need to do to govern
your data via standards and monitoring.
Proprietary
Meeting the Challenges with Big and Little Data:
Characteristics of a Trusted Source of Data
1. SECURE: Data is protected against inappropriate access or use through policies, processes, and tools.
2. RELIABLE: Data processing is predictable and reliable. The system is monitored for performance. Controls are in
place to detect and respond to unexpected events.
3. DATA QUALITY IS KNOWN: The criteria for high quality data are defined. Levels of quality are measured and
reported on. Data issues are communicated to data consumers and remediated based on business priorities.
4. TRANSPARENT AND COMPREHENSIBLE: Data consumers have the information (Metadata) they need to
understand and get value from the data. Knowledge about the system and its data is documented, accessible,
usable, and current.
5. SUPPORTED: A dedicated production support team is in place and has the processes and protocols it needs to
respond in a timely manner to questions and issues related to the operation of the system and the data in the
system.
6. COMMUNICATED: New data consumers have access to relevant training; existing data consumers are informed of
changes that impact their uses of the data.
7. GOVERNED: Processes and accountabilities are in place to make decisions about the data in the system.
Proprietary
Goals from Agenda
Agenda
• Introductions
• Quality management concepts and principles
• Applying quality management to traditional data
• Big Data challenges
• Data Quality Practices for Big Data and Little Data
Thank you!
Laura Sebastian-Coleman
Sebastian-ColemanL@Aetna.com
©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 90
Proprietary
References
DAMA International. DAMA Data Management Body of Knowledge. 2nd edition.
Technics, 2017.
The Data Governance Institute:
http://www.datagovernance.com/adg_data_governance_definition/
English, Larry. Improving Data Warehouse and Business Information Quality. John
Wiley & Sons, 1999.
Tony Fisher, Validating Data in the Data Lake. Zaloni Blog. December 15, 2016.
Gartner, “Gartner says Beware the Data Lake Fallacy”
https://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-
for-data-quality-improvement/
https://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-for-data-
quality-improvement/
https://www.google.com/search?q=cost+of+poor+quality+data&rlz=1C1GCEB_enUS
782US782&oq=cost+of+poor&aqs=chrome.2.69i57j0j69i59j0l3.3766j0j4&sourceid=c
hrome&ie=UTF-8
Anja Klein, Hong-Hai Do, Marcel Karnstedt Wolfgang Lehner Gregor
Hackenbroich. “Representing Data Quality for Streaming and Static Data”
https://www.researchgate.net/publication/4297383_Representing_Data_Quality
_for_Streaming_and_Static_Data
Vinu Kumar. “Solving Data Quality in Streaming Data Flows”
https://streamsets.com/blog/solving-data-quality-streaming-data-flows/
Viktor Mayer-Schonberger and Kenneth Cukier, Big Data: A revolution that will
transform how we live, work, and think.
Danette McGilvray. Executing Data Quality Projects: Ten Steps to Quality Data
and Trusted Information. Morgan Kaufmann, 2008.
John Myers, "How to answer the top three objections to a data lake." Info
World. September 6, 2016
Thomas Redman. “Bad Data Quality Costs the US $3 trillion per year.” Harvard
Business review. https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-
year
Laura Sebastian-Coleman. Measuring Data Quality for Ongoing Improvement.
Morgan Kaufmann, 2013. Appendices:
https://booksite.elsevier.com/9780123970336/downloads/Sebastian-
Coleman_Appendix%20E.pdf
Sunil Soares https://www.dataversity.net/big-data-governance-over-streaming-
data/#
TechTarget:
https://searchbusinessanalytics.techtarget.com/definition/unstructured-data
Video Coin. “The 5 Most Important Metrics To Measure The Performance of
Video Streaming” https://medium.com/videocoin/the-5-most-important-metrics-
to-measure-the-performance-of-video-streaming-ab41f4eb9d99
James Warner. "Innovative, unheard of use cases of streaming analytics“
https://internetofthingsagenda.techtarget.com/blog/IoT-Agenda/Innovative-
unheard-of-use-cases-of-streaming-analytics
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 91
Big Data Appendix
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 92
Proprietary
Data Quality Assessment and Monitoring Overview
93
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary
Source: Sebastian-Coleman. Measuring Data Quality for Ongoing
Improvement. Morgan Kaufmann, 2013.
Proprietary
Approaches in Traditional and Big Data Environments
TRADITIONAL INTEGRATED DATA
WAREHOUSE
Semiotic: Standardize data and definitions, via a
data model. Everyone will love it.
Knowledge: Rely on business SMEs for input.
They know everything. Information Architects will
capture knowledge in the data model.
Technical: Adopt a single technology to execute
ETL and integrate data. All data comes through
one route and is standardized via that route. If
possible, adopt a single BI tool.
Political: Reassure everyone that the data they
are used to will be “the same”
BIG DATA ENVIRONMENT (Lake, Fabric) ace]
Semiotic: Assume data from different sources will
fit together. If it doesn’t, people will figure it out,
they are data scientists after all.
Knowledge: Ask for a data dictionary but don’t
worry if you don’t get it. Assume that the people
requesting the data know what the data
represents so other people probably will, too
Technical: Allow multiple technologies, for
integration and analysis. Hope that people get the
same answers from the same data, even though
the tools work in totally different ways.
Political: Reassure everyone that the data is
correct because “it is what the source provided”
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 94
Proprietary
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 95
Proprietary
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 96
Proprietary
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 97
Proprietary
Define and Measure: Traditional Data
Define
• The data itself: know / document what the
data represents
• Expectations / quality characteristics for
the data
• Standards
• Rules
Measure: Actual data against expectations
Completeness Rule: Column is mandatory. It
must be populated.
Validity: Valid values include X,Y, Z. All other
values are invalid.
# of records populated with a valid value /
Total # of records =
Percentage of records that meet quality rule
This is a very simple example, but the idea can
be extended from columns, to files, to data
domains.
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 98
Proprietary
Why challenges are intensified with Big Data: The V’s
Variety: Measurement of quality depends
on data’s inherent structure.
Volume and Velocity affect how veracity
(truth, and from there, quality) can be
defined.
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 99
Type of Data Volume Velocity Veracity
Trad Mainframe Large but predictable Potentially fast but predictable
Measurable (compare to real world or other
data)
Trad Relational DB Large but predictable Potentially fast but predictable
Measurable (compare to real world or other
data)
Big Machine Generated Potentially huge Super fast, variable
Dependent on calibration of instrument /
collection device
Big Unstructured Potentially huge Super fast, variable What would this even mean?
VARIETY Type of Data Example Inherent Structure Structure driven by
Trad Mainframe EBCDIC files High, but messy
Design of the originating
system
Trad Relational DB
Warehouse
tables
High Data Model
Big
Machine
Generated
Streaming
sensor data
Very high
Design of the collection
device
Big Unstructured Twitter Low
Application interface,
Language of user
Proprietary
Why challenges are intensified with Big Data
Dimensions of quality provide a way to think about data quality measurement for Big Data
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 100
Type of Data Completeness Format Consistency Validity Integrity Consistency
Trad Mainframe
Number of records
generated / time period.
Comparison to a known
real-world population.
Constrained by system
rules
Constrained by rules
Can be systematically
constrained
Expectation based on
the process the data
represents
Trad Relational DB
Number of records
generated / time period.
Comparison to a known
real-world population.
Constrained by model,
can be systematically
constrained
Constrained by rules
Can be systematically
constrained
Expectation based on
the process the data
represents
Big Machine Generated
Rate at which data is
collected
Constrained by
collection device
Based on calibration of
collection device
Depends on consistent
collection devices
Expectation based on
the process the data
represents
Big Unstructured
No general definition of
“completeness”
Not applicable Not applicable Not applicable Not applicable
Machine generated – Quality depends on the machines that collect the data.
Unstructured – Quality depends on having adequate metadata describing individual data sets.
Proprietary
Approaches to DQ Measurement for Streaming Data
Streaming Video
• Data Quality characteristics =
product characteristics: Color,
speed, image resolution, sound /
image synchronization
• Risks:
• The initial data could be
corrupted
• Delivery system does not
deliver as expected
• Because the biggest risk is
interference with the delivery
system, Quality = Signal-to-Noise
ratio
Streaming Sensor Data
• Data Quality characteristics are
defined by calibrating the collection
device.
• Risks
• Devices calibrated
inconsistently
• Interference with a device
• Alignment between data from
related sensor streams
• Quality assessed through
• Metadata related to the
conditions of data collection.
• Monitoring temporal aspect of
delivery
• Patterns in data content
©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 101
“Traditional” Data Content Streamed
• Data quality characteristics are similar to
those in traditional data (e.g., Field =
Mandatory or optional; Criteria are defined
for Validity; “same” field has the “same”
content). Risks
• Data collected incorrectly
• Data lost in process of delivery
• Measurement done instream: compare
incoming data to existing data (e.g.,
reference or master data; content of
existing ‘records’)
• Quality = the level of exceptions; challenge
in establishing the denominator for any
measurement (e.g., likely a timeframe,
rather than a population of records)

More Related Content

What's hot

KSA Business Intelligence Qualifications
KSA Business Intelligence QualificationsKSA Business Intelligence Qualifications
KSA Business Intelligence QualificationsJDOLIV
 
Data quality and bi
Data quality and biData quality and bi
Data quality and bijeffd00
 
Real-World Data Governance: Managing Governance Metadata for Mass Consumption
Real-World Data Governance: Managing Governance Metadata for Mass ConsumptionReal-World Data Governance: Managing Governance Metadata for Mass Consumption
Real-World Data Governance: Managing Governance Metadata for Mass ConsumptionDATAVERSITY
 
OAUG 05-2009-MDM-1683-A Fiteni CPA, CMA
OAUG 05-2009-MDM-1683-A Fiteni CPA, CMAOAUG 05-2009-MDM-1683-A Fiteni CPA, CMA
OAUG 05-2009-MDM-1683-A Fiteni CPA, CMAAlex Fiteni
 
RWDG Webinar: Achieving Data Quality Through Data Governance
RWDG Webinar: Achieving Data Quality Through Data GovernanceRWDG Webinar: Achieving Data Quality Through Data Governance
RWDG Webinar: Achieving Data Quality Through Data GovernanceDATAVERSITY
 
The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...Pieter De Leenheer
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
Improve IT Security and Compliance with Mainframe Data in Splunk
Improve IT Security and Compliance with Mainframe Data in SplunkImprove IT Security and Compliance with Mainframe Data in Splunk
Improve IT Security and Compliance with Mainframe Data in SplunkPrecisely
 
Sustaining Data Governance and Adding Value for the Long Term
Sustaining Data Governance and Adding Value for the Long TermSustaining Data Governance and Adding Value for the Long Term
Sustaining Data Governance and Adding Value for the Long TermFirst San Francisco Partners
 
Data Quality Management - Data Issue Management & Resolutionn / Practical App...
Data Quality Management - Data Issue Management & Resolutionn / Practical App...Data Quality Management - Data Issue Management & Resolutionn / Practical App...
Data Quality Management - Data Issue Management & Resolutionn / Practical App...Burak S. Arikan
 
Lean six sigma
Lean six sigmaLean six sigma
Lean six sigmaBLUEZ09
 
Data Virtualization for Business Consumption (Australia)
Data Virtualization for Business Consumption (Australia)Data Virtualization for Business Consumption (Australia)
Data Virtualization for Business Consumption (Australia)Denodo
 
Restructuring The Government Ict Infrastructures And Standards To Achieve Glo...
Restructuring The Government Ict Infrastructures And Standards To Achieve Glo...Restructuring The Government Ict Infrastructures And Standards To Achieve Glo...
Restructuring The Government Ict Infrastructures And Standards To Achieve Glo...Ravi Tirumalai
 
Role of Analytics in Delivering Health Information to help fight Cancer in Au...
Role of Analytics in Delivering Health Information to help fight Cancer in Au...Role of Analytics in Delivering Health Information to help fight Cancer in Au...
Role of Analytics in Delivering Health Information to help fight Cancer in Au...Deanna Kosaraju
 
Data Management Process Improvement
Data Management Process ImprovementData Management Process Improvement
Data Management Process ImprovementMNI08072014
 
Real-World Data Governance: Governance Risk and Compliance
Real-World Data Governance: Governance Risk and ComplianceReal-World Data Governance: Governance Risk and Compliance
Real-World Data Governance: Governance Risk and ComplianceDATAVERSITY
 

What's hot (20)

KSA Business Intelligence Qualifications
KSA Business Intelligence QualificationsKSA Business Intelligence Qualifications
KSA Business Intelligence Qualifications
 
Data quality and bi
Data quality and biData quality and bi
Data quality and bi
 
Data Quality Presentation
Data Quality PresentationData Quality Presentation
Data Quality Presentation
 
Real-World Data Governance: Managing Governance Metadata for Mass Consumption
Real-World Data Governance: Managing Governance Metadata for Mass ConsumptionReal-World Data Governance: Managing Governance Metadata for Mass Consumption
Real-World Data Governance: Managing Governance Metadata for Mass Consumption
 
C. lwanga Yonke
C. lwanga YonkeC. lwanga Yonke
C. lwanga Yonke
 
OAUG 05-2009-MDM-1683-A Fiteni CPA, CMA
OAUG 05-2009-MDM-1683-A Fiteni CPA, CMAOAUG 05-2009-MDM-1683-A Fiteni CPA, CMA
OAUG 05-2009-MDM-1683-A Fiteni CPA, CMA
 
RWDG Webinar: Achieving Data Quality Through Data Governance
RWDG Webinar: Achieving Data Quality Through Data GovernanceRWDG Webinar: Achieving Data Quality Through Data Governance
RWDG Webinar: Achieving Data Quality Through Data Governance
 
The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Improve IT Security and Compliance with Mainframe Data in Splunk
Improve IT Security and Compliance with Mainframe Data in SplunkImprove IT Security and Compliance with Mainframe Data in Splunk
Improve IT Security and Compliance with Mainframe Data in Splunk
 
Sustaining Data Governance and Adding Value for the Long Term
Sustaining Data Governance and Adding Value for the Long TermSustaining Data Governance and Adding Value for the Long Term
Sustaining Data Governance and Adding Value for the Long Term
 
Data Quality Management - Data Issue Management & Resolutionn / Practical App...
Data Quality Management - Data Issue Management & Resolutionn / Practical App...Data Quality Management - Data Issue Management & Resolutionn / Practical App...
Data Quality Management - Data Issue Management & Resolutionn / Practical App...
 
Lean six sigma
Lean six sigmaLean six sigma
Lean six sigma
 
Data Virtualization for Business Consumption (Australia)
Data Virtualization for Business Consumption (Australia)Data Virtualization for Business Consumption (Australia)
Data Virtualization for Business Consumption (Australia)
 
Restructuring The Government Ict Infrastructures And Standards To Achieve Glo...
Restructuring The Government Ict Infrastructures And Standards To Achieve Glo...Restructuring The Government Ict Infrastructures And Standards To Achieve Glo...
Restructuring The Government Ict Infrastructures And Standards To Achieve Glo...
 
RungananW-DA&DG 201701 V2.0
RungananW-DA&DG 201701 V2.0RungananW-DA&DG 201701 V2.0
RungananW-DA&DG 201701 V2.0
 
Role of Analytics in Delivering Health Information to help fight Cancer in Au...
Role of Analytics in Delivering Health Information to help fight Cancer in Au...Role of Analytics in Delivering Health Information to help fight Cancer in Au...
Role of Analytics in Delivering Health Information to help fight Cancer in Au...
 
Data Management Process Improvement
Data Management Process ImprovementData Management Process Improvement
Data Management Process Improvement
 
Real-World Data Governance: Governance Risk and Compliance
Real-World Data Governance: Governance Risk and ComplianceReal-World Data Governance: Governance Risk and Compliance
Real-World Data Governance: Governance Risk and Compliance
 
Infographic: Data Governance Best Practices
Infographic: Data Governance Best Practices Infographic: Data Governance Best Practices
Infographic: Data Governance Best Practices
 

Similar to Dw19 t1+ +dq+fundamentals-cvs+template

Data-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesData-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesDATAVERSITY
 
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Health Catalyst
 
The New Age Data Quality
The New Age Data QualityThe New Age Data Quality
The New Age Data QualityRanjeet202050
 
Building Rules for Data Governance
Building Rules for Data GovernanceBuilding Rules for Data Governance
Building Rules for Data GovernancePrecisely
 
chapter12-220725121546-610a1427.pdf
chapter12-220725121546-610a1427.pdfchapter12-220725121546-610a1427.pdf
chapter12-220725121546-610a1427.pdfMahmoudSOLIMAN380726
 
‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management
‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management
‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality ManagementAhmed Alorage
 
A Business-first Approach to Building Data Governance Program
A Business-first Approach to Building Data Governance ProgramA Business-first Approach to Building Data Governance Program
A Business-first Approach to Building Data Governance ProgramPrecisely
 
7 principles of data quality management
7 principles of data quality management7 principles of data quality management
7 principles of data quality managementMileyJames
 
Data Integrity: From speed dating to lifelong partnership
Data Integrity: From speed dating to lifelong partnershipData Integrity: From speed dating to lifelong partnership
Data Integrity: From speed dating to lifelong partnershipPrecisely
 
From Compliance to Customer 360: Winning with Data Quality & Data Governance
From Compliance to Customer 360: Winning with Data Quality & Data GovernanceFrom Compliance to Customer 360: Winning with Data Quality & Data Governance
From Compliance to Customer 360: Winning with Data Quality & Data GovernancePrecisely
 
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipelineQlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipelineSrikanth Sharma Boddupalli
 
Data-Ed Webinar: Data Quality Engineering
Data-Ed Webinar: Data Quality EngineeringData-Ed Webinar: Data Quality Engineering
Data-Ed Webinar: Data Quality EngineeringDATAVERSITY
 
Data architecture around risk management
Data architecture around risk managementData architecture around risk management
Data architecture around risk managementSuvradeep Rudra
 
Stop the madness - Never doubt the quality of BI again using Data Governance
Stop the madness - Never doubt the quality of BI again using Data GovernanceStop the madness - Never doubt the quality of BI again using Data Governance
Stop the madness - Never doubt the quality of BI again using Data GovernanceMary Levins, PMP
 
Presentation by Cédric Charlier (Elia) at the Data Vault Modelling and Data G...
Presentation by Cédric Charlier (Elia) at the Data Vault Modelling and Data G...Presentation by Cédric Charlier (Elia) at the Data Vault Modelling and Data G...
Presentation by Cédric Charlier (Elia) at the Data Vault Modelling and Data G...Patrick Van Renterghem
 
CDMP SLIDE TRAINER .pptx
CDMP SLIDE TRAINER .pptxCDMP SLIDE TRAINER .pptx
CDMP SLIDE TRAINER .pptxssuser65981b
 
Fuel your Data-Driven Ambitions with Data Governance
Fuel your Data-Driven Ambitions with Data GovernanceFuel your Data-Driven Ambitions with Data Governance
Fuel your Data-Driven Ambitions with Data GovernancePedro Martins
 
Cff data governance best practices
Cff data governance best practicesCff data governance best practices
Cff data governance best practicesBeth Fitzpatrick
 
Data quality management Basic
Data quality management BasicData quality management Basic
Data quality management BasicKhaled Mosharraf
 
Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315Ashley Ohmann
 

Similar to Dw19 t1+ +dq+fundamentals-cvs+template (20)

Data-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesData-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success Stories
 
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
 
The New Age Data Quality
The New Age Data QualityThe New Age Data Quality
The New Age Data Quality
 
Building Rules for Data Governance
Building Rules for Data GovernanceBuilding Rules for Data Governance
Building Rules for Data Governance
 
chapter12-220725121546-610a1427.pdf
chapter12-220725121546-610a1427.pdfchapter12-220725121546-610a1427.pdf
chapter12-220725121546-610a1427.pdf
 
‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management
‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management
‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management
 
A Business-first Approach to Building Data Governance Program
A Business-first Approach to Building Data Governance ProgramA Business-first Approach to Building Data Governance Program
A Business-first Approach to Building Data Governance Program
 
7 principles of data quality management
7 principles of data quality management7 principles of data quality management
7 principles of data quality management
 
Data Integrity: From speed dating to lifelong partnership
Data Integrity: From speed dating to lifelong partnershipData Integrity: From speed dating to lifelong partnership
Data Integrity: From speed dating to lifelong partnership
 
From Compliance to Customer 360: Winning with Data Quality & Data Governance
From Compliance to Customer 360: Winning with Data Quality & Data GovernanceFrom Compliance to Customer 360: Winning with Data Quality & Data Governance
From Compliance to Customer 360: Winning with Data Quality & Data Governance
 
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipelineQlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
 
Data-Ed Webinar: Data Quality Engineering
Data-Ed Webinar: Data Quality EngineeringData-Ed Webinar: Data Quality Engineering
Data-Ed Webinar: Data Quality Engineering
 
Data architecture around risk management
Data architecture around risk managementData architecture around risk management
Data architecture around risk management
 
Stop the madness - Never doubt the quality of BI again using Data Governance
Stop the madness - Never doubt the quality of BI again using Data GovernanceStop the madness - Never doubt the quality of BI again using Data Governance
Stop the madness - Never doubt the quality of BI again using Data Governance
 
Presentation by Cédric Charlier (Elia) at the Data Vault Modelling and Data G...
Presentation by Cédric Charlier (Elia) at the Data Vault Modelling and Data G...Presentation by Cédric Charlier (Elia) at the Data Vault Modelling and Data G...
Presentation by Cédric Charlier (Elia) at the Data Vault Modelling and Data G...
 
CDMP SLIDE TRAINER .pptx
CDMP SLIDE TRAINER .pptxCDMP SLIDE TRAINER .pptx
CDMP SLIDE TRAINER .pptx
 
Fuel your Data-Driven Ambitions with Data Governance
Fuel your Data-Driven Ambitions with Data GovernanceFuel your Data-Driven Ambitions with Data Governance
Fuel your Data-Driven Ambitions with Data Governance
 
Cff data governance best practices
Cff data governance best practicesCff data governance best practices
Cff data governance best practices
 
Data quality management Basic
Data quality management BasicData quality management Basic
Data quality management Basic
 
Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315
 

Recently uploaded

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 

Recently uploaded (20)

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 

Dw19 t1+ +dq+fundamentals-cvs+template

  • 1. Proprietary ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 1 #damaweek2019 Data Quality Fundamentals Laura Sebastian-Coleman, Ph.D., CDMP November 2019
  • 2. Data Quality Fundamentals Laura Sebastian-Coleman, Ph.D., CDMP Data Quality Lead Shared Services Enterprise Data Governance, CVS Health DAMA Days – DAMA Mexico November 2019 ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary
  • 3. Proprietary Abstract: DQ Fundamentals • Organizations today get value from their data in the face of challenging odds. • Optimal management of traditional data requires a wide skillset and a strategic perspective. • Changes in technology have increased the volume, velocity, and variety of data, but many organizations do not yet have a handle on veracity in traditional data management environments, never mind big data environments. • And, while big data is on the rise, more traditional forms of data are not going away. Instead, different kinds of data will co-exist and must be managed in conjunction with one another. • This tutorial will revisit the fundamentals of data quality management in the light of big data and explore how to apply them in traditional and big data environments. • Participants will learn how to assess the current state of their data environment and deliver more reliable data to their stakeholders.
  • 4. Proprietary About me Data quality practitioner in the health care industry since 2003 Background in banking, manufacturing / distribution, commercial insurance, and academia Publications – Author, Navigating the Labyrinth: An Executive Guide to Data Management (2019) – Production Editor, DAMA Data Management Body Of Knowledge second edition, [DMBOK2] (2017) – Author, Measuring Data Quality for Ongoing Improvement (2013) Service – Advisor, DAMA New England, 2019 - present – DAMA Publications Officer, 2015 – 2019 – IAIDQ (now IQ International) Member Director, 2010-12 Recognition – DAMA International Recognition for Outstanding Contributions to Data Management, 2019 – DAMA New England Award for Excellence in Data Management, 2019 – IAIDQ (now IQ International) Distinguished Member Award, 2015
  • 6. Proprietary Abstract and Agenda Abstract: Organizations today get value from their data in the face of challenging odds. Optimal management of traditional data requires a wide skillset and strategic perspective. Changes in technology have increased the volume, velocity, and variety of data, but many organizations do not yet have a handle on veracity in traditional data management environments, never mind big data environments. And, while big data is on the rise, more traditional forms of data are not going away. Instead, different kinds of data will co-exist and must be managed in conjunction with one another. This tutorial will revisit the fundamentals of data quality management in the light of big data and explore how to apply them in traditional and big data environments. Participants will learn how to assess the current state of their data environment and deliver more reliable data to their stakeholders. Agenda • Introductions • Quality management concepts and principles • Applying quality management to traditional data • The role of measurement and monitoring • Big Data challenges • Data Quality Practices for Big Data and Little Data
  • 7. Proprietary Why Data Quality matters: Because data is valuable
  • 8. Proprietary Why DQ Management Matters: Poor quality data cost money • Reports differ, but many estimate that between 10-30% of productivity is lost due to poor quality data. • This seems low, since one report indicated that Data scientists spend 60% of their time cleansing data. • IBM estimated that data quality problems cost the US $3 Trillion in 2016. Unproductive Time 30% Productive Time 70% 10-30% of productivity is lost due to poor quality data Time spent Cleansing Data 60% Time spent Analyzing Data 40% Data Scientists' time: Cleansing vs. Analyzing Data
  • 9. Quality Management Concepts A short history of an important idea ©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 9
  • 10. Proprietary Definition of Quality: Fitness for Purpose / Fitness for Use Data Quality: A measure of the degree to which data is fit for the purposes of the people, processes, and systems that use the data. The concept of “fit for purpose” directly relates data quality to the quality of manufactured products. Data = a Product. Data NOT a by-product. “Fit for Purpose” also relates data quality to the concept of a data consumer – a person, process, or a system that uses data. Data Quality Management: A set of activities intended to ensure that data is fit for purpose by data consumers.
  • 11. Proprietary Manufacturing: A brief history of mass-produced products 19th Century Industrial Revolution: • Steam power • Interchangeable parts • Development of large factories • Production line manufacturing processes • Machine tooling 20th Century Mass Production: • Machine tooled interchangeable parts • Assembly line • Vertical integration of the manufacturing process • Quality control
  • 12. Proprietary Power, Process, Technology, and Standardization enabled Vertical Integration of Manufacturing
  • 13. Proprietary Pioneers of Quality Control • Defined criteria for quality based on customer expectations • Recognized the relation between a well-defined process and a predictable outcome • Used measurement to manage and improve processes • Created tools to assess and improve product quality • Recognized that producing a quality product requires life cycle management, supply chain management, and leadership commitment
  • 14. Proprietary Quality Control in Manufacturing: Product and Process A process is a series of steps that turn inputs into outputs. • The better quality the inputs • The better defined the steps • The better quality the outputs Add to this the idea that the execution of processes can be improved through observation, analysis, and feedback. The more consistent the input and the more consistently the process is executed, the more consistent the result.
  • 15. Proprietary Quality and the Customer Thought leaders in Quality Control / Quality improvement recognize that there is a customer at the end of the assembly line: Someone wants to buy the product. That person has expectations at two levels: • At the very least, the Product must perform its primary function. • Ideally, the Product also pleases the customer in some way. Dimensions of Product Quality (from David Garvin) – Performance: The product operates as expected. – Features: The product has additional characteristics that please the customer. – Reliability: The product works well. The customer can count on it. – Conformance: The product meets standards. – Durability: The product lasts for an expected amount of time. – Serviceability: If the product breaks it can be fixed. – Aesthetics: The product is attractive and pleasing. – Perceived Quality: The customer feels good about the product.
  • 16. Proprietary Intention and quality: Quality is not accidental Source: Kaizen institute of India. https://kaizeninstituteindia.wordpress.com/2013/10/08/quality-is- not-an-act-it-is-a-habit/
  • 17. Proprietary Life Cycle Management Life Cycle management is an extension of the idea of quality control to all aspects of creating a product.
  • 18. Proprietary The Role of Measurement in Quality Control Statistical process control – a means to measure the consistency of processes Measurement formalizes expectations Monitoring ensures unexpected variation within the system is detected
  • 19. Applying Quality Management Concepts to Data Produce data like we produce other products ©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 19
  • 20. Proprietary Definition: Data Quality Management Data Quality: A measure of the degree to which data is fit for the purposes of the people and systems that use the data. Data Quality Management: A set of activities intended to ensure that data is fit for purpose, including: • Data quality assessment • Data quality requirements definition • Data quality monitoring • Data issue detection • Issue remediation • Reporting on data quality • Improving business and technical process to ensure data is of high quality What you mean by high quality data How you detect low quality data What you do about low quality data All data management processes have the potential to impact the fitness of data for use. Not every process needs to be called “data quality” processes. Core Data Quality processes have foundational, project- oriented, and operational components.
  • 21. Proprietary Stuff Connected with DQ Management but are not exactly DQ Management Quality Assurance • Kinda DQ: QA focuses on quality. • Not Quite DQ: Focus on functionality, may or may not include data. Project process, rather than ongoing process. System Controls (manage data movement) • Kinda DQ: System Controls help confirm data completeness – they show that you have not lost data. • Not Quite DQ: You need them to run the system, regardless of the quality of data. Architecture / System Design to enforce quality • Kinda DQ: System design can directly impact the quality of data • Not Quite DQ: System design encompasses many other things that are not data Metadata Management • Kinda DQ: You cannot understand data without metadata • Not Quite DQ: Metadata is a form of data. It requires the same kind of DQ management that other forms of data require. Data Stewardship • Kinda DQ: Stewards know a lot about data and much of what they do helps us understand data quality • Not Quite DQ: Stewardship is wider than quality and may not even focus on quality. Data Cleansing • Kinda DQ: It makes the data better. Isn’t that the point? • Not Quite DQ: Data cleansing is a solution to some data quality issues. It is not a goal of DQ Management. ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 21
  • 22. Proprietary Data as the Product of a Process Process: A process is a series of steps that turn inputs into outputs
  • 23. Proprietary Data as the Product of a Process DQ problems are usually detected in Data Output But those problems can be caused at any point in the production or consumption process Data Quality is understood in terms of outputs • Expected outputs = Good Quality Data • Unexpected outputs = Poor Quality Data
  • 24. Proprietary Complexity increases risks associated with data Risk multiplies as data moves along the data chain from system-to- system, use-to-use.
  • 25. Proprietary Intention: Data Quality Improvement via PDCA The same processes that are applied to improve the quality of manufactured products can be applied to improve the quality of data. Different improvement methodologies use essentially the same process. • Six Sigma • Lean • Total Quality Management
  • 26. Proprietary Include Simplified Improvement cycle ©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 26 Define Quality: Expectations, Standards, Rules Assess Data Against Expectations, Standards, Rules Define Measurement / Monitoring requirements Monitor Data Quality Report on Data Quality results ManageData Issues Identify and act on Improvement opportunities Basis for Assessing Quality Revise / Improve Provide Input to Provide Input to Manage issues
  • 27. Proprietary ©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 27 Define Quality: Expectations, Standards, Rules AssessData Against Expectations, Standards, Rules Define Measurement / Monitoring requirements Monitor Data Quality Report on Data Quality results ManageData Issues Identify and act on Improvement opportunities Basis for Assessing Quality Revise / Improve Provide Input to Provide Input to Manage issues Requires:  Standards for rules  Criteria for criticality  SME / Data Consumer input  Working set of CDEs  Feedback process  Maintenance process Requires:  Accessto data  Profiling engine and process  Evaluation methodology  SME / Data Consumer input Requires:  Standards for rules  Analysis of historical data  Specification template  SME / Data Consumer input  Staff to implement and maintain Requires:  Guidelines and goals for monitoring  Processautomation / tooling  Staff to review and respond  Response protocols  Access to system and business SMEs Requires:  Goalsbased on SME / Data Consumer input  Reporting standards/ templates  Reporting tool  Schedule Requires:  Processflow  Issuedefinition template  Prioritization criteria  Escalation path  Tracking tool  SME / Data Consumer input  Accessto decision makers Requires:  Knowledgeof businessgoals  Knowledgeof data issues  SME / Data Consumer input  Root cause analysisskills  Proposal process  Funding process
  • 29. Proprietary Data Life Cycle Management Adapted from Danette McGilvray, Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information
  • 30. Proprietary Manage Data Quality throughout the Data Life Cycle Managing quality throughout the data life cycle requires • Data Governance • Metadata Management Adapted from Danette McGilvray, Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information
  • 31. The Role of Measurement and Monitoring You cannot manage what you cannot measure ©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 31
  • 32. Proprietary Dimensions of Data Quality – Why they matter • Data quality dimensions function in the way that length, width, and height function to express the size of a physical object. • They allow understanding of quality in relation to a scale and in relation to other data measured against the same scale. • Data quality dimensions can be used to define expectations (the standards against which to measure) for the quality of a desired dataset, as well as to measure the condition of an existing dataset. • Dimensions provide an understanding of why we measure (what question a measurement answers). For example, to understand the level of completeness, validity, and integrity of data. • Dimensions also help us identify things that we cannot measure or that there is little value in measuring.
  • 33. Proprietary Data Quality / Quality of Data Data Quality / Quality of Data: A measure of the degree to which data is fit for the purposes of the people and systems that use the data. What contributes to data’s “fitness for purpose”? • Representational Effectiveness: How well and consistently data represents the concepts it stands for. • Data Knowledge: How well data consumers understand and can de-code the data. • Dimensions of Quality: How well data conforms to expectations expressed via measurable Characteristics of Quality ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 33
  • 34. Proprietary Dimensions of Data Quality A Dimension of Data Quality is a characteristic that of data that can be measured and through which its quality can be quantified. There are many frameworks that define DQ dimensions. There is not an agreed-to set. However all account for similar concepts, which have a common sense meaning. • COMPLETENESS: You have all the pieces of data you need or expect to have. • FORMAT CONFORMITY: Data is in the form you expect it to be in. • VALIDITY: Data values belong to the set of possible (expected) values. • INTEGRITY: Different pieces of data relate to each other in the ways you expect them to. • CONSISTENCY: Data follows patterns that you expect it to follow.
  • 35. Proprietary Data Quality Issue / Data Quality Improvement Data Quality Issue: A data quality issue is any condition of the data that is an obstacle to a data consumer’s use of the data, regardless of the root cause of the obstacle. • Issues can be caused by actual errors – a person or a process made a mistake. • Or by any of the challenges inherent in data • People misunderstand / misinterpret • Technology does unexpected things • People disagree Data Quality Improvement: A Data Quality improvement is a measurable, positive change in data that makes it more fit for use. In other words: any change that reduces or removes an obstacle. ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 35
  • 36. Proprietary Logical Relationship between DQ Dimensions • COMPLETENESS: You have all the pieces of data you need or expect to have. If you do not have all the data you need, then other measurements of quality may not even matter. • FORMAT CONFORMITY: Data is in the form you expect it to be in. If data is not in the right format, then it cannot be valid or relate to other data in the ways you expect. • VALIDITY: Data values belong to the set of possible (expected) values. If the data values are not in the allowed set of values, then they cannot be correct. • INTEGRITY: Different pieces of data relate to each other in the ways you expect them to. If the different pieces of data do not fit together in the ways you expect, then you cannot use the data in the way you intended to. • CONSISTENCY: Data follows patterns that you expect it to follow. If the data does not follow expected patterns, then you will want to understand why (Change in the pattern or error in the data?)
  • 37. Proprietary Using DQ Dimensions to Create Standards & Rules Standard: A level of quality or attainment (a high standard for customer service); an idea or thing used as a measure, norm, or model in comparative evaluations (e.g., ISO standards) Rule: one of a set of explicit or understood regulations or principles governing conduct within a specific activity or sphere (e.g., Roberts Rules of Order); a principle that operates within a particular sphere of knowledge, describing or prescribing what is possible or allowable. Business Rule: A business rule is a rule that defines or constrains some aspect of business and always resolves to either true or false. Business rules are intended to assert business structure or to control or influence the behavior of the business. STANDARDS represent an intersection of Data Quality Management and Data Governance. Standards are also a form of metadata. DQ uses them to measure, but they are also simply useful to explain expectations.
  • 38. Proprietary Benefits of Rules • Common vocabulary – People understand expectations in a similar way • Consensus – People agree to the same things. • Differences – People can also disagree about rules. Rules provide a means of surfacing and therefore clarifying different expectations. • Simplicity – People make decisions once • Predictability – People know what is expected of them and they try to achieve it • And with data… they give us a way to talk more objectively about quality
  • 39. Proprietary Using DQ Dimensions to Create Standards & Rules Dimensions of quality are the foundation of a common vocabulary through which to articulate expectations for quality. They can be used to: • Create standards and rules & controls for models and applications • Establish measurements • Report problems consistently For example, completeness can be understood at several levels: system, data set, record, field. Dimension Data Object Rule Complete- ness Data Set The number of [distinct entity] must be equal to the number of [distinct entity] in [Source] Complete- ness Data Set [Amount field] must reconcile to [Amount field] in [Source] Complete- ness Field Must be populated Complete- ness Field Must be populated; standard default value allowed Complete- ness Optional fields, with population rules Must be populated when … Complete- ness Optional fields, with population rules Must be populated except when … Must NOT be populated when …
  • 40. Proprietary Logic for Field Level Completeness Dimensions enable you to consistently describe the characteristics you are looking for. Many rules can be defined through a logical progression of questions related to the dimension. Here is an example focused on completeness at the field level. Similar questions could be asked at the system or file level.
  • 41. Proprietary Logic for Format Conformity As with completeness, with conformity, we can establish a logical progression of questions to define expectations at the field level. Some fields have one-and-only-one acceptable format. More complex fields may have a set of format requirements. Others are constrained only by data type and format may not indicate much about the quality of data.
  • 42. Proprietary Logic for Validity The word validity is used to refer in a general sense to whether or not the data is “good”. As a dimension of quality, it refers to whether values are part of a defined domain. Validity Rules can be based on how the domain of values is defined.
  • 43. Proprietary Sample Rule Syntax – Validity for Codified Data Working through the decision tree results in a standard syntax for expressing rules. Rules can be used to: • Clarify expectations about quality • Measure data quality • Report on data quality Depending on how much we know about the data, they can also be used to transform data. For example, to populate a consistent default value for all invalid values. Dimension Data subset Rule Meaning Validity Codified data Valid values are limited to: [List of valid values] Specifies a list of values that are valid. All other values are invalid. Validity Codified data Values must exist in [code table / column …] Specifies the code table and the column in the code table in which valid values are stored. All other values are invalid. Validity Codified data The range of valid values is between: [MIN] and [MAX] Provides the MIN and MAX value for the range. Any values outside of the MIN/MAX are invalid. Validity Codified data Invalid values include: [List of invalid values] Specifies a list of values that are not valid. All other values are valid.
  • 44. Proprietary Using DQ Dimensions to Create Good Measurements Characteristics of Good Measurements • Meaningful: They are focused on characteristics that are important. » Think: taking a child’s temperature • Comprehensible: They present information that people can understand. » Think: understanding how to read a thermometer • Actionable: They allow people to make a decision or take an action. » Think: knowing what to do when the temperature is higher than normal Dimensions of help in all of these • Meaningful: They are focused on characteristics that are important – They define what = GOOD Quality and what = BAD Quality • Comprehensible: They present information that people can understand. – They allow people to understand what is RIGHT or WRONG with the data – it is incomplete, invalid, etc. • Actionable: They allow people to make a decision or take an action. – They allow people to decide whether or not the data they want to use is FIT FOR PURPOSE
  • 45. Proprietary Sample Data Quality Standards Data Element Completeness Conformity Validity Integrity Date of Birth Must be populated Must conform to the date format requirements of the system in which it is present Cannot be a future date; person cannot be older than 120 years (based on current date) All occurrences of records for an individual should have the same date of birth Birth Gender Must be populated See Validity Values must exist in [list OR code table / column …] All occurrences of records for an individual should have the same birth gender Gender Identity Optional -- no known rules for population See Validity Values must exist in [list OR code table / column …] All occurrences of records for an individual within a time frame should have the same gender identity
  • 46. Proprietary Sample Data Quality Standards Data Element Completeness Conformity Validity Integrity Ethnicity Optional -- no known rules for population See Validity Values must exist in [list OR code table / column …] Optional field - no integrity rules Race Optional -- no known rules for population See Validity Values must exist in [list OR code table / column …] Optional field - no integrity rules Marital Status Situational -- required for some business processes See Validity Values must exist in [list OR code table / column …] Situational Relationship to subscriber Must be populated See Validity Values must exist in [list OR code table / column …] No rules identified; value can change over time.
  • 47. Proprietary Applying DQ Dimensions – Lessons Learned • The dimensions provided a new perspective on the data. • Seeing the data separate from and in relation to systems – What is optional in a system may be mandatory to a downstream process • Translating common sense expectations about the data into consistent, objective criteria for reasonability – Every person has a birth date and we can define a reasonable range for birth dates within a database • Seeing gaps in expectations – Marital status -- should everyone have a marital status, even a child? Or should only subscribers have this? • Seeing that some concepts are not well-defined and may never be well-defined – Race, ethnicity • Some concepts that we once considered well defined are evolving – Gender identity vs. birth gender Despite the flux, we were able to come to consensus on our expectations – the dimensions provided a vocabulary to do so. They allowed us to clarify expectations in a consistent manner.
  • 48. Proprietary Example Measurement This measures the level of completeness [MUST BE POPULATED] of a critical field It is very simple, because the concept itself is very simple. In many cases, you don’t need more than this. 0.0000 0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000 0.8000 0.9000 27/12/2018 29/12/2018 31/12/2018 02/01/2019 04/01/2019 06/01/2019 08/01/2019 10/01/2019 12/01/2019 14/01/2019 16/01/2019 18/01/2019 20/01/2019 22/01/2019 24/01/2019 26/01/2019 28/01/2019 30/01/2019 01/02/2019 03/02/2019 05/02/2019 07/02/2019 Trend
  • 49. Proprietary Example of Rolled Up Score Card with visual ©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 49 MAY 2019 Summary Scorecard Dimensions of Quality Line of Business Source System Completeness Conformity Validity Overall Score Upper Threshold Lower Threshold Status ABC ABC1 97% 93% 100% 96.5% 98.5% 95.0% Amber ABC1 97% 94% 100% 96.7% 98.5% 95.0% Amber DEF DEF1 96% 93% 100% 96.1% 98.5% 95.0% Amber DEF2 97% 91% 100% 96.1% 98.5% 95.0% Amber GHI GHI1 99% 98% 100% 98.7% 98.5% 95.0% Green JKL JKL1 96% 98% 100% 98.0% 98.5% 95.0% Amber MNO MNO1 97% 93% 100% 96.7% 98.5% 95.0% Amber
  • 50. Proprietary ©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 50 90.0% 91.0% 92.0% 93.0% 94.0% 95.0% 96.0% 97.0% 98.0% 99.0% 100.0% Overall Data Quality Nov-18 Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19 ABC1 ABC2 DEF1 DEF2 GHI1 JKL1 MNO1
  • 51. Limitations of “Data as a Product” OR Why Data is Different Subtitle ©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 51
  • 52. Proprietary Limitations of the Product Metaphor The Challenges • Data is not a physical product. • Data is not tangible, but it is durable • Data is easy to copy but very hard to reproduce from scratch • The same data can be used by multiple people and processes at the same time • Data is volatile • The value of data changes based on context and timing • Using data often results in new data Adapted from DMBOK2, chapter 1, which is adapted from Redman, Data Driven. The Risks An organization does not know what data it has Data can be lost, breached, or misused Data is replicated and variation is created between data sets The quality of data deteriorates over time or across functions Knowledge of data deteriorates within the organization NOT JUST FITNESS FOR PURPOSE Representational effectiveness: How well and consistently data represents the concepts is it intended to represent Data Knowledge: How easily and well data consumers can “decode” data
  • 53. Proprietary Power, Process, Technology, and Standardization enabled Vertical Integration of Manufacturing
  • 54. Proprietary We don’t treat data as a product • Production: Data comes from many places; very little control over the inputs • Inventory: Organizations do not know what data they have, what condition it is in, what relation it has to the processes that created it, etc. • Storage: The ways that we store data have an impact on its quality, but we do not always account for this when we work with data. • Usage: Don’t know how data will be used – bring this into question. We do not recognize the connection between data production and data uses What would happen if we treated physical products the way that we treat data?
  • 55. Proprietary What is the Product? Who is the customer? The challenges with the product concept are related to how data evolves within the data lifecycle and to different levels of awareness of data as a product along the data chain: • Evolution: Data has many uses, and these uses change over time. – Example: Mail order companies once wanted your address simply to ship you a product. Now they want it to understand customer demographic patterns. • Evolution: Once people start using data, they want to refine data. – Example: Transition from ICD-9 to ICD-10 represents a refinement of diagnosis codes • Data Chain: Data that meets its initial quality criteria may not be of high quality for downstream uses. – Example: Data may be good enough to enable a claim to be adjudicated, but not good enough to do outreach to a member • Data Chain: Many upstream processes are not aware of the downstream uses of data. – A field that is not required for Provider Demographics may be required to assess quality of care
  • 56. Proprietary DQ Challenges stem from the nature of data • The semiotic challenge: People have different ways of representing the “same” concepts. Data is disparate. • The knowledge challenge: Because data is complicated, a single individual cannot know all the data. • The technical challenge: Different technical approaches to creating and using data influence the data itself and impact its quality. • The political challenge: Data is knowledge, knowledge is power, power is political. ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 56 The simple answer: Define, Measure, Monitor • Define: To reduce ambiguity • Measure: To confirm the actual state of the data • Monitor: Detect change over time More complicated answer: These are hard to do
  • 58. Proprietary The Semiotic Challenge: Reality and Data Source: Measuring Data Quality for Ongoing Improvement. By Laura Sebastian-Coleman (Morgan Kauffmann, 2013)
  • 59. Proprietary The Semiotic Challenge: Data and Reality Source: Measuring Data Quality for Ongoing Improvement. By Laura Sebastian-Coleman (Morgan Kauffmann, 2013)
  • 60. Proprietary Data Quality as a Technical Challenge Different technical approaches to creating and using data influence the data itself. Example: SAS vs. Hadoop rounding difference (Desc and acct numbers modified for example) Problem: When data is extracted in Hadoop & SAS using the same query, there is a difference in the number of records extracted (7,192 records for Jan-18 period). Observation: All records have YTD_Actual less than 50 cents, absolute value Hypothesis: It looks like Hadoop rounds differently than SAS, which resulted in data that has values rounded to '0' in 'YTD_ACTUAL' field was excluded from the query. PERIOD_KE Y MAJ_ACCT_DESC MIN_ACCT_DESC YTD_ACTU AL Jan-18 Account 1 Health Management, LLC 0.01 Jan-18 Account 2 MEDICARE -0.01 Jan-18 PREPAID EXP Commissions-HMO Based Product -0.02 Jan-18 CURR & DEF'D TAXES CUR INC TAXES - STATE 0.29 Jan-18 MISC LIAB Settlements -0.1 Jan-18 EXP - GEN'L Telecom Comm Equip:Owned -0.39 Jan-18 EXP - GEN'L Phone - Local/Long Distance 0.13 Jan-18 EXP - GEN'L Phone - Local/Long Distance 0.33
  • 61. Proprietary Data Quality: The Knowledge Challenge The knowledge challenge: In any organization, data is more complicated than a single person can comprehend. Because data is complicated, it cannot be managed without metadata (documented knowledge about the data). The challenge goes beyond knowledge of the data to knowledge of how to manage data quality. It includes: • Unexplored assumptions about data and data management – some of which we covered in the semiotic challenge. • Lack of consensus about the meaning of key concepts (Data, Data Quality, Data Quality measurement) – which is why I started with definitions. • Lack of clear goals and deliverables for the data assessment process. • Lack of a methodology for defining “requirements”, “expectations” and other criteria for the quality of data at the level needed for measurement. These criteria are necessary for measurement.
  • 62. Proprietary Data Quality as a Political Challenge The political challenge: Data is knowledge, knowledge is power, power is political Etymology of Politics: • Poli = many • Ticks = blood sucking vermin Most people dislike politics. People do not always mean to be political about data. But data represents business processes, so people are protective. Their data may be high quality OR it may be low quality data OR they may not know. Data is about knowledge and people like to be knowledgeable. No one likes to feel “un- knowledgeable” (i.e., dumb).
  • 63. Overcoming the Knowledge Challenge Through Profiling and Data Inspection ©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 63
  • 64. Proprietary Definition: Data Quality & Data Quality Management Data Quality: A measure of the degree to which data is fit for the purposes of the people and systems that use the data. Data Quality Management: A set of activities intended to ensure that data is fit for purpose, including: • Data quality assessment • Data quality requirements definition • Data quality monitoring • Data issue detection • Issue remediation • Reporting on data quality • Improving business and technical process to ensure data is of high quality All data management processes have the potential to impact the fitness of data for use. But not every data management process needs to be called a “data quality” process. This is the activity of running the engine and looking at the results (data profiling and analysis). Analysis of profiling results supports these activities
  • 65. Proprietary Definition: Data Profiling • Assessment is the process of evaluating or estimating the nature, ability, or quality of a thing. • Data quality assessment is the process of evaluating data to identify errors and understand their implications (Maydanchik, 2007). • Data profiling is a specific kind of data analysis used to discover and characterize important features of data fields and data sets, including: – Data types – Field lengths – Cardinality of columns – Granularity – Existing values – Format patterns – Content patterns – Implied rules – Cross-column and cross-file data relationships
  • 67. Proprietary Profiling Goals – Overcoming the Knowledge Challenge in Projects • Reduce risks related to data development • Enable initial assessment of source-supplied metadata to reduce the risk of errors related to incorrect identification of data fields • Identify risks and obstacles to use of sources (data issues, incorrect assumptions, differences in data granularity, naming conventions, etc.) • Accurately identify encryption requirements • Identify critical data for ongoing data quality measurement, monitoring, and reporting • Improve project process efficiency • Improve the quality and consistency of system metadata, beginning with table and column definitions • Provide input to mapping, including conformance • Provide input to data modeling • Provide input to ETL design, including system controls • Provide input for Quality Assurance and User Acceptance Testing • Enable Governance over time • Data quality monitoring • Improved metadata
  • 68. Proprietary DART – Data Analysis Results Template The DART’s worksheets break down into four groups: • Reference information: Five tabs describe the template and provide guidance on how to observe data characteristics in profiling results • Project information: Two tabs bookend the process. One for project goals, the other for summarized findings and action items. • Findings and analysis: Four tabs that make up the core of the template and allow analysts to consistently document what they see. (Note: At this time, the DQ Analysis tab will not be used by projects) • DQ specification: Captures details for DQ measurements. Input for this will come from the findings and analysis tabs REFERENCE TABS TEMPLATE PURPOSE AND OVERVIEW GUIDELINES and USAGE NOTES DQ CHECK LIST FIELD DEFINITIONS DOWNLOADS PROJECT ADMIN TABS PROJECT DETAILS SIGN OFFS and ACTION ITEMS ANALYSIS AND FINDINGS TAB CONTEXT and METADATA RESULTS TABLE LEVEL RESULTS COLUMN LEVEL For Project Work SPECIFICATION TAB DQ MEASUREMENT SPECIFICATION
  • 70. Proprietary Example Overall Findings FINDING CATEGORY COUNT PERCENTAGE No Data -- 100% defaulted 85 35% Data appears as expected 64 26% Technical field 21 9% Data Differs from Metadata 19 8% Questionable Values 17 7% Should be encrypted, is not 10 4% Sparse Data -- 99% Defaulted 9 4% Questionable Population 7 3% Conformance Risk 6 2% Questionable Population and Values 4 2% TOTAL 242 100% 0 10 20 30 40 50 60 70 80 90 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00%
  • 71. Proprietary Making findings actionable FINDING CATEGORY COUNT PERCENTAGE ACTION Data appears as expected 64 26% No action Technical field 21 9% No action No Data -- 100% defaulted 85 35% Determine impact to project Data Differs from Metadata 19 8% Clarify with source system; update metadata Questionable Values 17 7% Clarify with source system; update metadata Sparse Data -- 99% Defaulted 9 4% Clarify with source system; update metadata Questionable Population 7 3% Clarify with source system; update metadata Questionable Population and Values 4 2% Clarify with source system; update metadata Conformance Risk 6 2% Inform BA's and Modelers Should be encrypted, is not 10 4% Revise encryption requirements for file TOTAL 242 100% 0 10 20 30 40 50 60 70 80 90 COUNT Of FINDING CATEGORYNo action required Determine if there is impact Clarify with Source System Manage within the workstream • Inform BA’s and Modelers • Update requirements
  • 72. Big Data Challenges Because it is not getting simpler… ©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 72
  • 73. Proprietary Some people are optimistic about Big Data Often, big data is messy, varies in quality .... What we lose in accuracy at the micro level we gain in insight at the macro level. Viktor Mayer-Schonberger and Kenneth Cukier, Big Data: A revolution that will transform how we live, work, and think. A data lake's data quality practices are less about the syntactic quality of the data (are all the fields perfect?) and more about the semantic quality of the data (can we use this well?). John Myers, "How to answer the top three objections to a data lake." Info World. September 6, 2016 People who object to data lakes are only defending the care, feeding, and maintenance of a data warehouse. The types of 'needs' that this objection is attempting to address are data governance, quality, stewardship, and lineage. John Myers, "How to answer the top three objections to a data lake." Info World. September 6, 2016
  • 74. Proprietary And other people are not We see customers creating big data graveyards, dumping everything into HDFS and hoping to do something with it down the road. But they just lose track of what's there. Sean Martin, Cambridge Semantics Many companies are guilty of dumping data into the data lake without a strategy for keeping track of what's being ingested. This leads to a murky, swampy repository .... Unlike relational databases, Hadoop is little help when it comes to quality control. Tony Fisher, Validating Data in the Data Lake. Zaloni Blog. December 15, 2016. Without at least some semblance of information governance, the lake will end up being a collection of disconnected data pools or information silos all in one place. Gartner, "Gartner says Beware the Data Lake Fallacy" Some data lake initiatives have not succeeded, producing instead more silos or empty sandboxes. Brian Stein and Alan Morrison, "The enterprise data lake: Better integration and deeper analytics." PWC Technology Forecast: Rethinking Integration. Issue 1, 2014.
  • 75. Proprietary Big Data – Definition The DMBOK2 points out: • The term Big Data is associated with technological changes that have enabled people “to generate, store, and analyze larger and larger amounts of data.” • People us this data “to predict and influence behavior, as well as gain insight on a range of important subjects, such as health care practices, natural resource management, and economic development”. And Shopping. Big Data goes hand-in-hand with data science: • Changes in technology not only enable collection of huge amounts of data, they also enable analysis of it. • Data Science includes the creation of models that enable understanding of possible outcomes, if variables change.
  • 76. Proprietary Data as the Product of a Process DQ problems are usually detected in Data Output But those problems can be caused at any point in the production or consumption process Data Quality is understood in terms of outputs • Expected outputs = Good Quality Data • Unexpected outputs = Poor Quality Data
  • 77. Proprietary Big Data Production Process has Different Risks Still a relationship between inputs, steps, and outputs, but more risk in the process. Risk can be reduced through knowledge of the original production processes for the data.
  • 78. Proprietary Life Cycle Management The same questions apply to Big Data as apply to traditional data. The same connections exist between data quality, metadata and data governance.
  • 79. Proprietary Big Data and the Product Metaphor Big Data • Volume • Variety • Velocity • Veracity Creating more data, of different kinds, more quickly. Different types of data have different degrees of structure depending on how they are produced. • Production: Data comes from many places; very little control over the inputs • Inventory: Organizations do not know what data they have, what condition it is in, what relation it has to the processes that created it, etc. • Storage: The ways that we store data have an impact on its quality • Usage: Don’t know how data will be used – bring this into question. We do not recognize the connection between data production and data uses. • For Big Data, these problems are intensified.
  • 80. Proprietary Volume & Velocity: Impact on Veracity Big Data Characteristics of volume and velocity have an effect on how veracity (truth, and from there, quality) can even be defined. Type of Data Volume Velocity Veracity Mainframe Large but predictable Fast but predictable Measurable Tabular Large but predictable Fast but predictable Measurable Machine Generated Potentially huge Super fast Calibration Unstructured Potentially huge As fast as people can produce it What would this even mean?
  • 81. Proprietary Variety We associate Big Data with new kinds of data, but a lot of traditional data is also being stored in data lakes. Big Data is often referred to as “unstructured”, but it contains a lot of semi-structured and also contains forms of data that are inherently structured by virtue of how they are collected. Type of Data Example Inherent Structure Trad Mainframe EBCDIC files High, but messy Trad Tabular Warehouse tables High Big Machine Generated Sensor data Very high Big Unstructured Twitter Low
  • 82. Proprietary Logical Relationship between DQ Dimensions • COMPLETENESS: You have all the pieces of data you need or expect to have. If you do not have all the data you need, then other measurements of quality may not even matter. • FORMAT CONFORMITY: Data is in the form you expect it to be in. If data is not in the right format, then it cannot be valid or relate to other data in the ways you expect. • VALIDITY: Data values belong to the set of possible (expected) values. If the data values are not in the allowed set of values, then they cannot be correct. • INTEGRITY: Different pieces of data relate to each other in the ways you expect them to. If the different pieces of data do not fit together in the ways you expect, then you cannot use the data in the way you intended to. • CONSISTENCY: Data follows patterns that you expect it to follow. If the data does not follow expected patterns, then you will want to understand why (Change in the pattern or error in the data?)
  • 83. Proprietary Big Data and Dimensions of Quality Dimensions of quality provide a means to think about how to approach quality for big data. Type of Data Completeness Format Consistency Validity Integrity Consistency Mainframe Number of records generated / time period. Constrained by rules Constrained by rules Can be systematically constrained Expectation based on the process the data represents Tabular Number of records generated / time period. Constrained by rules Constrained by rules Can be systematically constrained Expectation based on the process the data represents Machine Generated Rate at which data is collected Constrained by collection device Based on calibration of collection device Depends on consistent collection devices Expectation based on the process the data represents Unstructured ?? Not relevant Not applicable ?? No expectation of consistency
  • 84. Proprietary Data Quality Challenges – Intensified by Big Data • The semiotic challenge: People have different ways of representing the “same” concepts – Traditional: GOVERNANCE – Big: Governance, but at the category and metadata level • The technical challenge: Different technical approaches to creating and using data influence the data itself. – Traditional: DATA STANDARDS – Big: Manage the ingest process, esp. manage metadata up front • The knowledge challenge: Because data is complicated, a single individual cannot know all the data. – Traditional: METADATA – Big: Metadata is even more important • The political challenge: Data is knowledge, knowledge is power, power is political – Traditional: GOVERNANCE / CULTURE – Big: Governance/Culture
  • 85. Meeting Big Data Challenges Let’s do this thing! ©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 85
  • 86. Proprietary Meeting the Challenges for Big Data METADATA – Addressing the knowledge challenge • Production and Lineage: Data comes from many places; very little control over the inputs. Need to know where data comes from • Inventory: Inventorying how much data an organization has / what data it has • Storage: Need to understand how ingest and storage process impacts data • Usage: We will never know all the potential uses of data. Ensure consumers know what the data represents, how it was produced, how it is stored • Metadata management: Enabling data usage by managing knowledge of data; set minimum requirements for metadata related to big data. The priorities change. It is not possible to define every field in the way you would with traditional data GOVERNANCE – Addressing data, process, and cultural risks • Accountability: Defining data ownership and accountability • Protection: Protecting against the misuse of data • Risk mitigation: Managing risks associated with data • Standards: Defining and enforcing standards for data quality
  • 87. Proprietary Summary Product Management practices do work for traditional data. They also work for Big Data, but with modifications based on the production processes of Big Data. Managing the quality of both Big Data and traditional (little) data is dependent on managing metadata. The process of figuring out how to manage your data will significantly inform what you need to do to govern your data via standards and monitoring.
  • 88. Proprietary Meeting the Challenges with Big and Little Data: Characteristics of a Trusted Source of Data 1. SECURE: Data is protected against inappropriate access or use through policies, processes, and tools. 2. RELIABLE: Data processing is predictable and reliable. The system is monitored for performance. Controls are in place to detect and respond to unexpected events. 3. DATA QUALITY IS KNOWN: The criteria for high quality data are defined. Levels of quality are measured and reported on. Data issues are communicated to data consumers and remediated based on business priorities. 4. TRANSPARENT AND COMPREHENSIBLE: Data consumers have the information (Metadata) they need to understand and get value from the data. Knowledge about the system and its data is documented, accessible, usable, and current. 5. SUPPORTED: A dedicated production support team is in place and has the processes and protocols it needs to respond in a timely manner to questions and issues related to the operation of the system and the data in the system. 6. COMMUNICATED: New data consumers have access to relevant training; existing data consumers are informed of changes that impact their uses of the data. 7. GOVERNED: Processes and accountabilities are in place to make decisions about the data in the system.
  • 89. Proprietary Goals from Agenda Agenda • Introductions • Quality management concepts and principles • Applying quality management to traditional data • Big Data challenges • Data Quality Practices for Big Data and Little Data
  • 90. Thank you! Laura Sebastian-Coleman Sebastian-ColemanL@Aetna.com ©2018 CVS Health and/or one of its affiliates: Confidential & Proprietary 90
  • 91. Proprietary References DAMA International. DAMA Data Management Body of Knowledge. 2nd edition. Technics, 2017. The Data Governance Institute: http://www.datagovernance.com/adg_data_governance_definition/ English, Larry. Improving Data Warehouse and Business Information Quality. John Wiley & Sons, 1999. Tony Fisher, Validating Data in the Data Lake. Zaloni Blog. December 15, 2016. Gartner, “Gartner says Beware the Data Lake Fallacy” https://www.gartner.com/smarterwithgartner/how-to-create-a-business-case- for-data-quality-improvement/ https://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-for-data- quality-improvement/ https://www.google.com/search?q=cost+of+poor+quality+data&rlz=1C1GCEB_enUS 782US782&oq=cost+of+poor&aqs=chrome.2.69i57j0j69i59j0l3.3766j0j4&sourceid=c hrome&ie=UTF-8 Anja Klein, Hong-Hai Do, Marcel Karnstedt Wolfgang Lehner Gregor Hackenbroich. “Representing Data Quality for Streaming and Static Data” https://www.researchgate.net/publication/4297383_Representing_Data_Quality _for_Streaming_and_Static_Data Vinu Kumar. “Solving Data Quality in Streaming Data Flows” https://streamsets.com/blog/solving-data-quality-streaming-data-flows/ Viktor Mayer-Schonberger and Kenneth Cukier, Big Data: A revolution that will transform how we live, work, and think. Danette McGilvray. Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information. Morgan Kaufmann, 2008. John Myers, "How to answer the top three objections to a data lake." Info World. September 6, 2016 Thomas Redman. “Bad Data Quality Costs the US $3 trillion per year.” Harvard Business review. https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per- year Laura Sebastian-Coleman. Measuring Data Quality for Ongoing Improvement. Morgan Kaufmann, 2013. Appendices: https://booksite.elsevier.com/9780123970336/downloads/Sebastian- Coleman_Appendix%20E.pdf Sunil Soares https://www.dataversity.net/big-data-governance-over-streaming- data/# TechTarget: https://searchbusinessanalytics.techtarget.com/definition/unstructured-data Video Coin. “The 5 Most Important Metrics To Measure The Performance of Video Streaming” https://medium.com/videocoin/the-5-most-important-metrics- to-measure-the-performance-of-video-streaming-ab41f4eb9d99 James Warner. "Innovative, unheard of use cases of streaming analytics“ https://internetofthingsagenda.techtarget.com/blog/IoT-Agenda/Innovative- unheard-of-use-cases-of-streaming-analytics ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 91
  • 92. Big Data Appendix ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 92
  • 93. Proprietary Data Quality Assessment and Monitoring Overview 93 ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary Source: Sebastian-Coleman. Measuring Data Quality for Ongoing Improvement. Morgan Kaufmann, 2013.
  • 94. Proprietary Approaches in Traditional and Big Data Environments TRADITIONAL INTEGRATED DATA WAREHOUSE Semiotic: Standardize data and definitions, via a data model. Everyone will love it. Knowledge: Rely on business SMEs for input. They know everything. Information Architects will capture knowledge in the data model. Technical: Adopt a single technology to execute ETL and integrate data. All data comes through one route and is standardized via that route. If possible, adopt a single BI tool. Political: Reassure everyone that the data they are used to will be “the same” BIG DATA ENVIRONMENT (Lake, Fabric) ace] Semiotic: Assume data from different sources will fit together. If it doesn’t, people will figure it out, they are data scientists after all. Knowledge: Ask for a data dictionary but don’t worry if you don’t get it. Assume that the people requesting the data know what the data represents so other people probably will, too Technical: Allow multiple technologies, for integration and analysis. Hope that people get the same answers from the same data, even though the tools work in totally different ways. Political: Reassure everyone that the data is correct because “it is what the source provided” ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 94
  • 95. Proprietary ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 95
  • 96. Proprietary ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 96
  • 97. Proprietary ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 97
  • 98. Proprietary Define and Measure: Traditional Data Define • The data itself: know / document what the data represents • Expectations / quality characteristics for the data • Standards • Rules Measure: Actual data against expectations Completeness Rule: Column is mandatory. It must be populated. Validity: Valid values include X,Y, Z. All other values are invalid. # of records populated with a valid value / Total # of records = Percentage of records that meet quality rule This is a very simple example, but the idea can be extended from columns, to files, to data domains. ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 98
  • 99. Proprietary Why challenges are intensified with Big Data: The V’s Variety: Measurement of quality depends on data’s inherent structure. Volume and Velocity affect how veracity (truth, and from there, quality) can be defined. ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 99 Type of Data Volume Velocity Veracity Trad Mainframe Large but predictable Potentially fast but predictable Measurable (compare to real world or other data) Trad Relational DB Large but predictable Potentially fast but predictable Measurable (compare to real world or other data) Big Machine Generated Potentially huge Super fast, variable Dependent on calibration of instrument / collection device Big Unstructured Potentially huge Super fast, variable What would this even mean? VARIETY Type of Data Example Inherent Structure Structure driven by Trad Mainframe EBCDIC files High, but messy Design of the originating system Trad Relational DB Warehouse tables High Data Model Big Machine Generated Streaming sensor data Very high Design of the collection device Big Unstructured Twitter Low Application interface, Language of user
  • 100. Proprietary Why challenges are intensified with Big Data Dimensions of quality provide a way to think about data quality measurement for Big Data ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 100 Type of Data Completeness Format Consistency Validity Integrity Consistency Trad Mainframe Number of records generated / time period. Comparison to a known real-world population. Constrained by system rules Constrained by rules Can be systematically constrained Expectation based on the process the data represents Trad Relational DB Number of records generated / time period. Comparison to a known real-world population. Constrained by model, can be systematically constrained Constrained by rules Can be systematically constrained Expectation based on the process the data represents Big Machine Generated Rate at which data is collected Constrained by collection device Based on calibration of collection device Depends on consistent collection devices Expectation based on the process the data represents Big Unstructured No general definition of “completeness” Not applicable Not applicable Not applicable Not applicable Machine generated – Quality depends on the machines that collect the data. Unstructured – Quality depends on having adequate metadata describing individual data sets.
  • 101. Proprietary Approaches to DQ Measurement for Streaming Data Streaming Video • Data Quality characteristics = product characteristics: Color, speed, image resolution, sound / image synchronization • Risks: • The initial data could be corrupted • Delivery system does not deliver as expected • Because the biggest risk is interference with the delivery system, Quality = Signal-to-Noise ratio Streaming Sensor Data • Data Quality characteristics are defined by calibrating the collection device. • Risks • Devices calibrated inconsistently • Interference with a device • Alignment between data from related sensor streams • Quality assessed through • Metadata related to the conditions of data collection. • Monitoring temporal aspect of delivery • Patterns in data content ©2019 CVS Health and/or one of its affiliates: Confidential & Proprietary 101 “Traditional” Data Content Streamed • Data quality characteristics are similar to those in traditional data (e.g., Field = Mandatory or optional; Criteria are defined for Validity; “same” field has the “same” content). Risks • Data collected incorrectly • Data lost in process of delivery • Measurement done instream: compare incoming data to existing data (e.g., reference or master data; content of existing ‘records’) • Quality = the level of exceptions; challenge in establishing the denominator for any measurement (e.g., likely a timeframe, rather than a population of records)