3. Proprietary
Abstract: DQ Fundamentals
• Organizations today get value from their data in the face of challenging odds.
• Optimal management of traditional data requires a wide skillset and a strategic perspective.
• Changes in technology have increased the volume, velocity, and variety of data, but many
organizations do not yet have a handle on veracity in traditional data management
environments, never mind big data environments.
• And, while big data is on the rise, more traditional forms of data are not going away. Instead,
different kinds of data will co-exist and must be managed in conjunction with one another.
• This tutorial will revisit the fundamentals of data quality management in the light of big data and
explore how to apply them in traditional and big data environments.
• Participants will learn how to assess the current state of their data environment and deliver
more reliable data to their stakeholders.
4. Proprietary
About me
Data quality practitioner in the health care industry since 2003
Background in banking, manufacturing / distribution, commercial insurance, and academia
Publications
– Author, Navigating the Labyrinth: An Executive Guide to Data Management (2019)
– Production Editor, DAMA Data Management Body Of Knowledge second edition, [DMBOK2] (2017)
– Author, Measuring Data Quality for Ongoing Improvement (2013)
Service
– Advisor, DAMA New England, 2019 - present
– DAMA Publications Officer, 2015 – 2019
– IAIDQ (now IQ International) Member Director, 2010-12
Recognition
– DAMA International Recognition for Outstanding Contributions to Data Management, 2019
– DAMA New England Award for Excellence in Data Management, 2019
– IAIDQ (now IQ International) Distinguished Member Award, 2015
6. Proprietary
Abstract and Agenda
Abstract:
Organizations today get value from their data in the face of
challenging odds. Optimal management of traditional data
requires a wide skillset and strategic perspective.
Changes in technology have increased the volume, velocity, and
variety of data, but many organizations do not yet have a handle
on veracity in traditional data management environments, never
mind big data environments. And, while big data is on the rise,
more traditional forms of data are not going away. Instead,
different kinds of data will co-exist and must be managed in
conjunction with one another.
This tutorial will revisit the fundamentals of data quality
management in the light of big data and explore how to apply
them in traditional and big data environments. Participants will
learn how to assess the current state of their data environment
and deliver more reliable data to their stakeholders.
Agenda
• Introductions
• Quality management concepts and
principles
• Applying quality management to traditional
data
• The role of measurement and monitoring
• Big Data challenges
• Data Quality Practices for Big Data and
Little Data
8. Proprietary
Why DQ Management Matters: Poor quality data cost money
• Reports differ, but many estimate that between 10-30% of productivity is lost due to poor quality
data.
• This seems low, since one report indicated that Data scientists spend 60% of their time
cleansing data.
• IBM estimated that data quality problems cost the US $3 Trillion in 2016.
Unproductive
Time
30%
Productive
Time
70%
10-30% of productivity is lost
due to poor quality data
Time spent
Cleansing
Data
60%
Time spent
Analyzing
Data
40%
Data Scientists' time:
Cleansing vs. Analyzing Data
10. Proprietary
Definition of Quality: Fitness for Purpose / Fitness for Use
Data Quality: A measure of the degree to which data is fit for
the purposes of the people, processes, and systems that use
the data.
The concept of “fit for purpose” directly relates data quality to
the quality of manufactured products.
Data = a Product. Data NOT a by-product.
“Fit for Purpose” also relates data quality to the concept of a data
consumer – a person, process, or a system that uses data.
Data Quality Management: A set of activities intended to
ensure that data is fit for purpose by data consumers.
11. Proprietary
Manufacturing: A brief history of mass-produced products
19th Century Industrial Revolution:
• Steam power
• Interchangeable parts
• Development of large factories
• Production line manufacturing processes
• Machine tooling
20th Century Mass Production:
• Machine tooled interchangeable parts
• Assembly line
• Vertical integration of the manufacturing process
• Quality control
13. Proprietary
Pioneers of Quality Control
• Defined criteria for quality based on customer
expectations
• Recognized the relation between a well-defined
process and a predictable outcome
• Used measurement to manage and improve
processes
• Created tools to assess and improve product
quality
• Recognized that producing a quality product
requires life cycle management, supply chain
management, and leadership commitment
14. Proprietary
Quality Control in Manufacturing: Product and Process
A process is a series of steps that turn inputs into outputs.
• The better quality the inputs
• The better defined the steps
• The better quality the outputs
Add to this the idea that the execution of processes can be improved through observation, analysis, and
feedback.
The more consistent the input and the more consistently the process is executed, the more consistent the
result.
15. Proprietary
Quality and the Customer
Thought leaders in Quality Control / Quality improvement recognize that there is a customer at the end of the
assembly line: Someone wants to buy the product.
That person has expectations at two levels:
• At the very least, the Product must perform its primary function.
• Ideally, the Product also pleases the customer in some way.
Dimensions of Product Quality (from David Garvin)
– Performance: The product operates as expected.
– Features: The product has additional characteristics that please the customer.
– Reliability: The product works well. The customer can count on it.
– Conformance: The product meets standards.
– Durability: The product lasts for an expected amount of time.
– Serviceability: If the product breaks it can be fixed.
– Aesthetics: The product is attractive and pleasing.
– Perceived Quality: The customer feels good about the product.
16. Proprietary
Intention and quality: Quality is not accidental
Source: Kaizen institute of India.
https://kaizeninstituteindia.wordpress.com/2013/10/08/quality-is-
not-an-act-it-is-a-habit/
18. Proprietary
The Role of Measurement in Quality Control
Statistical process control – a means to
measure the consistency of processes
Measurement formalizes expectations
Monitoring ensures unexpected
variation within the system is detected
20. Proprietary
Definition: Data Quality Management
Data Quality: A measure of the degree to which data is fit for the purposes of the people and
systems that use the data.
Data Quality Management: A set of activities intended to ensure that data is fit for purpose,
including:
• Data quality assessment
• Data quality requirements definition
• Data quality monitoring
• Data issue detection
• Issue remediation
• Reporting on data quality
• Improving business and technical process to ensure data is of high quality
What you mean
by high quality
data
How you detect
low quality data What you do
about low
quality data
All data management
processes have the
potential to impact the
fitness of data for use.
Not every process
needs to be called “data
quality” processes.
Core Data Quality
processes have
foundational, project-
oriented, and
operational
components.
22. Proprietary
Data as the Product of a Process
Process: A process is a series of steps that turn inputs into outputs
23. Proprietary
Data as the Product of a Process DQ problems
are usually
detected in
Data Output
But those problems can
be caused at any point in
the production or
consumption process
Data Quality is understood in terms of outputs
• Expected outputs = Good Quality Data
• Unexpected outputs = Poor Quality Data
25. Proprietary
Intention: Data Quality Improvement via PDCA
The same processes that are applied to improve the
quality of manufactured products can be applied to
improve the quality of data.
Different improvement methodologies use essentially
the same process.
• Six Sigma
• Lean
• Total Quality Management
29. Proprietary
Data Life Cycle Management
Adapted from Danette McGilvray,
Executing Data Quality Projects: Ten
Steps to Quality Data and Trusted
Information
30. Proprietary
Manage Data Quality throughout the Data Life Cycle
Managing quality
throughout the data life
cycle requires
• Data Governance
• Metadata Management
Adapted from Danette McGilvray,
Executing Data Quality Projects: Ten
Steps to Quality Data and Trusted
Information
32. Proprietary
Dimensions of Data Quality – Why they matter
• Data quality dimensions function in the way that length,
width, and height function to express the size of a physical
object.
• They allow understanding of quality in relation to a scale
and in relation to other data measured against the same
scale.
• Data quality dimensions can be used to define
expectations (the standards against which to measure) for
the quality of a desired dataset, as well as to measure the
condition of an existing dataset.
• Dimensions provide an understanding of why we measure
(what question a measurement answers). For example, to
understand the level of completeness, validity, and integrity of
data.
• Dimensions also help us identify things that we cannot
measure or that there is little value in measuring.
34. Proprietary
Dimensions of Data Quality
A Dimension of Data Quality is a characteristic that of data that can
be measured and through which its quality can be quantified.
There are many frameworks that define DQ dimensions. There is not
an agreed-to set. However all account for similar concepts, which
have a common sense meaning.
• COMPLETENESS: You have all the pieces of data you need or expect to
have.
• FORMAT CONFORMITY: Data is in the form you expect it to be in.
• VALIDITY: Data values belong to the set of possible (expected) values.
• INTEGRITY: Different pieces of data relate to each other in the ways you
expect them to.
• CONSISTENCY: Data follows patterns that you expect it to follow.
36. Proprietary
Logical Relationship between DQ Dimensions
• COMPLETENESS: You have all the pieces of data you need or expect to have.
If you do not have all the data you need, then other measurements of quality may not even matter.
• FORMAT CONFORMITY: Data is in the form you expect it to be in.
If data is not in the right format, then it cannot be valid or relate to other data in the ways you expect.
• VALIDITY: Data values belong to the set of possible (expected) values.
If the data values are not in the allowed set of values, then they cannot be correct.
• INTEGRITY: Different pieces of data relate to each other in the ways you expect them to.
If the different pieces of data do not fit together in the ways you expect, then you cannot use the data in the way
you intended to.
• CONSISTENCY: Data follows patterns that you expect it to follow.
If the data does not follow expected patterns, then you will want to understand why (Change in the pattern or error
in the data?)
37. Proprietary
Using DQ Dimensions to Create Standards & Rules
Standard: A level of quality or attainment (a high standard for customer
service); an idea or thing used as a measure, norm, or model in comparative
evaluations (e.g., ISO standards)
Rule: one of a set of explicit or understood regulations or principles
governing conduct within a specific activity or sphere (e.g., Roberts Rules of
Order); a principle that operates within a particular sphere of knowledge,
describing or prescribing what is possible or allowable.
Business Rule: A business rule is a rule that defines or constrains some aspect
of business and always resolves to either true or false. Business rules are
intended to assert business structure or to control or influence the behavior of
the business.
STANDARDS represent an intersection of Data Quality Management and Data
Governance. Standards are also a form of metadata. DQ uses them to
measure, but they are also simply useful to explain expectations.
38. Proprietary
Benefits of Rules
• Common vocabulary – People understand expectations
in a similar way
• Consensus – People agree to the same things.
• Differences – People can also disagree about rules.
Rules provide a means of surfacing and therefore
clarifying different expectations.
• Simplicity – People make decisions once
• Predictability – People know what is expected of them
and they try to achieve it
• And with data… they give us a way to talk more
objectively about quality
39. Proprietary
Using DQ Dimensions to Create Standards & Rules
Dimensions of quality are the foundation of a common
vocabulary through which to articulate expectations
for quality. They can be used to:
• Create standards and rules & controls for models
and applications
• Establish measurements
• Report problems consistently
For example, completeness can be understood at
several levels: system, data set, record, field.
Dimension Data Object Rule
Complete-
ness Data Set
The number of [distinct entity] must be
equal to the number of [distinct entity] in
[Source]
Complete-
ness Data Set
[Amount field] must reconcile to [Amount
field] in [Source]
Complete-
ness Field Must be populated
Complete-
ness Field
Must be populated; standard default value
allowed
Complete-
ness
Optional
fields, with
population
rules Must be populated when …
Complete-
ness
Optional
fields, with
population
rules
Must be populated except when …
Must NOT be populated when …
40. Proprietary
Logic for Field Level Completeness
Dimensions enable you to
consistently describe the
characteristics you are looking
for.
Many rules can be defined
through a logical progression of
questions related to the
dimension.
Here is an example focused on
completeness at the field level.
Similar questions could be asked
at the system or file level.
41. Proprietary
Logic for Format Conformity
As with completeness, with conformity,
we can establish a logical progression
of questions to define expectations at
the field level.
Some fields have one-and-only-one
acceptable format.
More complex fields may have a set of
format requirements.
Others are constrained only by data
type and format may not indicate much
about the quality of data.
42. Proprietary
Logic for Validity
The word validity is used to refer in a
general sense to whether or not the data is
“good”.
As a dimension of quality, it refers to
whether values are part of a defined
domain.
Validity Rules can be based on how the
domain of values is defined.
43. Proprietary
Sample Rule Syntax – Validity for Codified Data
Working through the decision tree
results in a standard syntax for
expressing rules.
Rules can be used to:
• Clarify expectations about quality
• Measure data quality
• Report on data quality
Depending on how much we know
about the data, they can also be used to
transform data. For example, to
populate a consistent default value for
all invalid values.
Dimension Data subset Rule Meaning
Validity
Codified
data
Valid values are
limited to: [List of
valid values]
Specifies a list of values that
are valid. All other values
are invalid.
Validity
Codified
data
Values must exist in
[code table / column
…]
Specifies the code table and
the column in the code
table in which valid values
are stored. All other values
are invalid.
Validity
Codified
data
The range of valid
values is between:
[MIN] and [MAX]
Provides the MIN and MAX
value for the range. Any
values outside of the
MIN/MAX are invalid.
Validity
Codified
data
Invalid values include:
[List of invalid values]
Specifies a list of values that
are not valid. All other
values are valid.
44. Proprietary
Using DQ Dimensions to Create Good Measurements
Characteristics of Good Measurements
• Meaningful: They are focused on characteristics that are important.
» Think: taking a child’s temperature
• Comprehensible: They present information that people can understand.
» Think: understanding how to read a thermometer
• Actionable: They allow people to make a decision or take an action.
» Think: knowing what to do when the temperature is higher than normal
Dimensions of help in all of these
• Meaningful: They are focused on characteristics that are important
– They define what = GOOD Quality and what = BAD Quality
• Comprehensible: They present information that people can understand.
– They allow people to understand what is RIGHT or WRONG with the data – it is incomplete, invalid, etc.
• Actionable: They allow people to make a decision or take an action.
– They allow people to decide whether or not the data they want to use is FIT FOR PURPOSE
45. Proprietary
Sample Data Quality Standards
Data Element Completeness Conformity Validity Integrity
Date of Birth Must be populated
Must conform to the date
format requirements of
the system in which it is
present
Cannot be a future date; person
cannot be older than 120 years
(based on current date)
All occurrences of records for an
individual should have the same
date of birth
Birth Gender Must be populated See Validity
Values must exist in [list OR code
table / column …]
All occurrences of records for an
individual should have the same
birth gender
Gender Identity
Optional -- no known
rules for population
See Validity
Values must exist in [list OR code
table / column …]
All occurrences of records for an
individual within a time frame
should have the same gender
identity
46. Proprietary
Sample Data Quality Standards
Data Element Completeness Conformity Validity Integrity
Ethnicity
Optional -- no known rules
for population
See Validity
Values must exist in [list OR
code table / column …]
Optional field - no integrity
rules
Race
Optional -- no known rules
for population
See Validity
Values must exist in [list OR
code table / column …]
Optional field - no integrity
rules
Marital Status
Situational -- required for
some business processes
See Validity
Values must exist in [list OR
code table / column …]
Situational
Relationship to
subscriber
Must be populated See Validity
Values must exist in [list OR
code table / column …]
No rules identified; value
can change over time.
47. Proprietary
Applying DQ Dimensions – Lessons Learned
• The dimensions provided a new perspective on the data.
• Seeing the data separate from and in relation to systems
– What is optional in a system may be mandatory to a downstream process
• Translating common sense expectations about the data into consistent, objective criteria for reasonability
– Every person has a birth date and we can define a reasonable range for birth dates within a database
• Seeing gaps in expectations
– Marital status -- should everyone have a marital status, even a child? Or should only subscribers have this?
• Seeing that some concepts are not well-defined and may never be well-defined
– Race, ethnicity
• Some concepts that we once considered well defined are evolving
– Gender identity vs. birth gender
Despite the flux, we were able to come to consensus on our expectations – the dimensions provided a vocabulary to do so.
They allowed us to clarify expectations in a consistent manner.
48. Proprietary
Example Measurement
This measures the level of
completeness [MUST BE
POPULATED] of a critical field
It is very simple, because the
concept itself is very simple.
In many cases, you don’t need
more than this.
0.0000
0.1000
0.2000
0.3000
0.4000
0.5000
0.6000
0.7000
0.8000
0.9000
27/12/2018
29/12/2018
31/12/2018
02/01/2019
04/01/2019
06/01/2019
08/01/2019
10/01/2019
12/01/2019
14/01/2019
16/01/2019
18/01/2019
20/01/2019
22/01/2019
24/01/2019
26/01/2019
28/01/2019
30/01/2019
01/02/2019
03/02/2019
05/02/2019
07/02/2019
Trend
52. Proprietary
Limitations of the Product Metaphor
The Challenges
• Data is not a physical product.
• Data is not tangible, but it is durable
• Data is easy to copy but very hard to reproduce from
scratch
• The same data can be used by multiple people and
processes at the same time
• Data is volatile
• The value of data changes based on context and
timing
• Using data often results in new data
Adapted from DMBOK2, chapter 1, which is adapted from Redman, Data Driven.
The Risks
An organization does not know what data it has
Data can be lost, breached, or misused
Data is replicated and variation is created between data
sets
The quality of data deteriorates over time or across
functions
Knowledge of data deteriorates within the organization
NOT JUST FITNESS FOR PURPOSE
Representational effectiveness: How well and
consistently data represents the concepts is it
intended to represent
Data Knowledge: How easily and well data
consumers can “decode” data
54. Proprietary
We don’t treat data as a product
• Production: Data comes from many places; very little control over the inputs
• Inventory: Organizations do not know what data they have, what condition it is in, what relation it has to the
processes that created it, etc.
• Storage: The ways that we store data have an impact on its quality, but we do not always account for this when we
work with data.
• Usage: Don’t know how data will be used – bring this into question. We do not recognize the connection between data
production and data uses
What would happen if we
treated physical products
the way that we treat data?
55. Proprietary
What is the Product? Who is the customer?
The challenges with the product concept are
related to how data evolves within the data
lifecycle and to different levels of awareness
of data as a product along the data chain:
• Evolution: Data has many uses, and these
uses change over time.
– Example: Mail order companies once wanted your
address simply to ship you a product. Now they
want it to understand customer demographic
patterns.
• Evolution: Once people start using data, they
want to refine data.
– Example: Transition from ICD-9 to ICD-10
represents a refinement of diagnosis codes
• Data Chain: Data that meets its initial quality criteria may
not be of high quality for downstream uses.
– Example: Data may be good enough to enable a claim to
be adjudicated, but not good enough to do outreach to a
member
• Data Chain: Many upstream processes are not aware of
the downstream uses of data.
– A field that is not required for Provider Demographics
may be required to assess quality of care
58. Proprietary
The Semiotic Challenge: Reality and Data
Source: Measuring Data Quality for Ongoing
Improvement. By Laura Sebastian-Coleman
(Morgan Kauffmann, 2013)
59. Proprietary
The Semiotic Challenge: Data and Reality
Source: Measuring Data Quality for Ongoing
Improvement. By Laura Sebastian-Coleman
(Morgan Kauffmann, 2013)
60. Proprietary
Data Quality as a Technical Challenge
Different technical approaches to creating and
using data influence the data itself.
Example: SAS vs. Hadoop rounding
difference
(Desc and acct numbers modified for example)
Problem: When data is extracted
in Hadoop & SAS using the same
query, there is a difference in the
number of records extracted
(7,192 records for Jan-18 period).
Observation: All records have
YTD_Actual less than 50 cents,
absolute value
Hypothesis: It looks like Hadoop
rounds differently than SAS,
which resulted in data that has
values rounded to '0' in
'YTD_ACTUAL' field was
excluded from the query.
PERIOD_KE
Y
MAJ_ACCT_DESC MIN_ACCT_DESC
YTD_ACTU
AL
Jan-18 Account 1 Health Management, LLC 0.01
Jan-18 Account 2 MEDICARE -0.01
Jan-18 PREPAID EXP Commissions-HMO Based Product -0.02
Jan-18 CURR & DEF'D TAXES CUR INC TAXES - STATE 0.29
Jan-18 MISC LIAB Settlements -0.1
Jan-18 EXP - GEN'L Telecom Comm Equip:Owned -0.39
Jan-18 EXP - GEN'L Phone - Local/Long Distance 0.13
Jan-18 EXP - GEN'L Phone - Local/Long Distance 0.33
61. Proprietary
Data Quality: The Knowledge Challenge
The knowledge challenge: In any organization, data is more complicated than a
single person can comprehend. Because data is complicated, it cannot be
managed without metadata (documented knowledge about the data).
The challenge goes beyond knowledge of the data to knowledge of how to manage
data quality. It includes:
• Unexplored assumptions about data and data management – some of which we
covered in the semiotic challenge.
• Lack of consensus about the meaning of key concepts (Data, Data Quality, Data
Quality measurement) – which is why I started with definitions.
• Lack of clear goals and deliverables for the data assessment process.
• Lack of a methodology for defining “requirements”, “expectations” and other
criteria for the quality of data at the level needed for measurement. These
criteria are necessary for measurement.
62. Proprietary
Data Quality as a Political Challenge
The political challenge: Data is knowledge, knowledge is power, power is political
Etymology of Politics:
• Poli = many
• Ticks = blood sucking vermin
Most people dislike politics.
People do not always mean to be political about data.
But data represents business processes, so people are protective.
Their data may be high quality OR it may be low quality data OR they may not know.
Data is about knowledge and people like to be knowledgeable. No one likes to feel
“un- knowledgeable” (i.e., dumb).
64. Proprietary
Definition: Data Quality & Data Quality Management
Data Quality: A measure of the degree to which data is fit for the purposes of the people and systems that use the data.
Data Quality Management: A set of activities intended to ensure that data is fit for purpose, including:
• Data quality assessment
• Data quality requirements definition
• Data quality monitoring
• Data issue detection
• Issue remediation
• Reporting on data quality
• Improving business and technical process to ensure data is of high quality
All data management processes have the potential to impact the fitness of data for use. But not every data management
process needs to be called a “data quality” process.
This is the activity of running
the engine and looking at the
results (data profiling and
analysis).
Analysis of profiling results
supports these activities
65. Proprietary
Definition: Data Profiling
• Assessment is the process of evaluating or estimating the nature, ability, or quality of a thing.
• Data quality assessment is the process of evaluating data to identify errors and understand their
implications (Maydanchik, 2007).
• Data profiling is a specific kind of data analysis used to discover and characterize important
features of data fields and data sets, including:
– Data types
– Field lengths
– Cardinality of columns
– Granularity
– Existing values
– Format patterns
– Content patterns
– Implied rules
– Cross-column and cross-file
data relationships
67. Proprietary
Profiling Goals – Overcoming the Knowledge Challenge in Projects
• Reduce risks related to data development
• Enable initial assessment of source-supplied metadata to reduce the risk of errors related to incorrect identification
of data fields
• Identify risks and obstacles to use of sources (data issues, incorrect assumptions, differences in data granularity,
naming conventions, etc.)
• Accurately identify encryption requirements
• Identify critical data for ongoing data quality measurement, monitoring, and reporting
• Improve project process efficiency
• Improve the quality and consistency of system metadata, beginning with table and column definitions
• Provide input to mapping, including conformance
• Provide input to data modeling
• Provide input to ETL design, including system controls
• Provide input for Quality Assurance and User Acceptance Testing
• Enable Governance over time
• Data quality monitoring
• Improved metadata
68. Proprietary
DART – Data Analysis Results Template
The DART’s worksheets break down into four groups:
• Reference information: Five tabs describe the template and provide guidance on how to observe data characteristics
in profiling results
• Project information: Two tabs bookend the process. One for project goals, the other for summarized findings and
action items.
• Findings and analysis: Four tabs that make up the core of the template and allow analysts to consistently document
what they see. (Note: At this time, the DQ Analysis tab will not be used by projects)
• DQ specification: Captures details for DQ measurements. Input for this will come from the findings and analysis tabs
REFERENCE TABS
TEMPLATE PURPOSE AND
OVERVIEW
GUIDELINES and USAGE NOTES
DQ CHECK LIST
FIELD DEFINITIONS
DOWNLOADS
PROJECT ADMIN TABS
PROJECT DETAILS
SIGN OFFS and ACTION ITEMS
ANALYSIS AND FINDINGS TAB
CONTEXT and METADATA RESULTS
TABLE LEVEL RESULTS
COLUMN LEVEL For Project Work
SPECIFICATION TAB
DQ MEASUREMENT
SPECIFICATION
70. Proprietary
Example Overall Findings
FINDING CATEGORY COUNT PERCENTAGE
No Data -- 100% defaulted 85 35%
Data appears as expected 64 26%
Technical field 21 9%
Data Differs from Metadata 19 8%
Questionable Values 17 7%
Should be encrypted, is not 10 4%
Sparse Data -- 99% Defaulted 9 4%
Questionable Population 7 3%
Conformance Risk 6 2%
Questionable Population and
Values 4 2%
TOTAL 242 100%
0
10
20
30
40
50
60
70
80
90
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
71. Proprietary
Making findings actionable
FINDING CATEGORY COUNT PERCENTAGE ACTION
Data appears as expected 64 26% No action
Technical field 21 9% No action
No Data -- 100% defaulted 85 35% Determine impact to project
Data Differs from Metadata 19 8% Clarify with source system; update metadata
Questionable Values 17 7% Clarify with source system; update metadata
Sparse Data -- 99% Defaulted 9 4% Clarify with source system; update metadata
Questionable Population 7 3% Clarify with source system; update metadata
Questionable Population and Values 4 2% Clarify with source system; update metadata
Conformance Risk 6 2% Inform BA's and Modelers
Should be encrypted, is not 10 4% Revise encryption requirements for file
TOTAL 242 100%
0
10
20
30
40
50
60
70
80
90
COUNT Of FINDING CATEGORYNo action required
Determine if there is impact
Clarify with Source System
Manage within the workstream
• Inform BA’s and Modelers
• Update requirements
73. Proprietary
Some people are optimistic about Big Data
Often, big data is messy,
varies in quality .... What we
lose in accuracy at the micro
level we gain in insight at the
macro level.
Viktor Mayer-Schonberger and Kenneth Cukier, Big
Data: A revolution that will transform how we live,
work, and think.
A data lake's data quality practices are
less about the syntactic quality of the
data (are all the fields perfect?) and more
about the semantic quality of the data
(can we use this well?).
John Myers, "How to answer the top three objections to a data
lake." Info World. September 6, 2016
People who object to data lakes are only
defending the care, feeding, and maintenance of
a data warehouse. The types of 'needs' that this
objection is attempting to address are data
governance, quality, stewardship, and lineage.
John Myers, "How to answer the top three objections to a data lake." Info
World. September 6, 2016
74. Proprietary
And other people are not
We see customers creating big
data graveyards, dumping
everything into HDFS and
hoping to do something with it
down the road. But they just
lose track of what's there.
Sean Martin, Cambridge Semantics
Many companies are guilty of dumping
data into the data lake without a strategy
for keeping track of what's being
ingested. This leads to a murky, swampy
repository .... Unlike relational databases,
Hadoop is little help when it comes to
quality control.
Tony Fisher, Validating Data in the Data Lake. Zaloni Blog. December 15,
2016.
Without at least some
semblance of information
governance, the lake will end
up being a collection of
disconnected data pools or
information silos all in one
place.
Gartner, "Gartner says Beware the Data Lake Fallacy"
Some data lake initiatives have not
succeeded, producing instead more
silos or empty sandboxes.
Brian Stein and Alan Morrison, "The enterprise data lake: Better
integration and
deeper analytics." PWC Technology Forecast: Rethinking
Integration. Issue 1, 2014.
75. Proprietary
Big Data – Definition
The DMBOK2 points out:
• The term Big Data is associated with technological changes that have enabled people “to
generate, store, and analyze larger and larger amounts of data.”
• People us this data “to predict and influence behavior, as well as gain insight on a range of
important subjects, such as health care practices, natural resource management, and economic
development”. And Shopping.
Big Data goes hand-in-hand with data science:
• Changes in technology not only enable collection of huge amounts of data, they also enable
analysis of it.
• Data Science includes the creation of models that enable understanding of possible outcomes, if
variables change.
76. Proprietary
Data as the Product of a Process DQ problems
are usually
detected in
Data Output
But those problems can
be caused at any point in
the production or
consumption process
Data Quality is understood in terms of outputs
• Expected outputs = Good Quality Data
• Unexpected outputs = Poor Quality Data
77. Proprietary
Big Data Production Process has Different Risks
Still a relationship between inputs, steps, and outputs, but more risk in the process.
Risk can be reduced through knowledge of the original production processes for the data.
78. Proprietary
Life Cycle Management
The same questions
apply to Big Data as
apply to traditional
data.
The same
connections exist
between data quality,
metadata and data
governance.
79. Proprietary
Big Data and the Product Metaphor
Big Data
• Volume
• Variety
• Velocity
• Veracity
Creating more data, of different kinds,
more quickly.
Different types of data have different
degrees of structure depending on
how they are produced.
• Production: Data comes from many places;
very little control over the inputs
• Inventory: Organizations do not know what
data they have, what condition it is in, what
relation it has to the processes that created
it, etc.
• Storage: The ways that we store data have
an impact on its quality
• Usage: Don’t know how data will be used –
bring this into question. We do not recognize
the connection between data production and
data uses.
• For Big Data, these problems are intensified.
80. Proprietary
Volume & Velocity: Impact on Veracity
Big Data Characteristics of volume and velocity have an effect on how veracity (truth, and from
there, quality) can even be defined.
Type of Data Volume Velocity Veracity
Mainframe Large but predictable Fast but predictable Measurable
Tabular Large but predictable Fast but predictable Measurable
Machine Generated Potentially huge Super fast Calibration
Unstructured Potentially huge
As fast as people can
produce it
What would this
even mean?
81. Proprietary
Variety
We associate Big Data with new kinds of data, but a lot of traditional data is also being stored in
data lakes.
Big Data is often referred to as “unstructured”, but it contains a lot of semi-structured and also
contains forms of data that are inherently structured by virtue of how they are collected.
Type of Data Example Inherent Structure
Trad Mainframe EBCDIC files High, but messy
Trad Tabular Warehouse tables High
Big Machine Generated Sensor data Very high
Big Unstructured Twitter Low
82. Proprietary
Logical Relationship between DQ Dimensions
• COMPLETENESS: You have all the pieces of data you need or expect to have.
If you do not have all the data you need, then other measurements of quality may not even matter.
• FORMAT CONFORMITY: Data is in the form you expect it to be in.
If data is not in the right format, then it cannot be valid or relate to other data in the ways you expect.
• VALIDITY: Data values belong to the set of possible (expected) values.
If the data values are not in the allowed set of values, then they cannot be correct.
• INTEGRITY: Different pieces of data relate to each other in the ways you expect them to.
If the different pieces of data do not fit together in the ways you expect, then you cannot use the data in the way
you intended to.
• CONSISTENCY: Data follows patterns that you expect it to follow.
If the data does not follow expected patterns, then you will want to understand why (Change in the pattern or error
in the data?)
83. Proprietary
Big Data and Dimensions of Quality
Dimensions of quality provide a means to think about how to approach quality for big data.
Type of Data Completeness Format Consistency Validity Integrity Consistency
Mainframe
Number of
records generated
/ time period.
Constrained by
rules
Constrained by
rules
Can be
systematically
constrained
Expectation based
on the process the
data represents
Tabular
Number of
records generated
/ time period.
Constrained by
rules
Constrained by
rules
Can be
systematically
constrained
Expectation based
on the process the
data represents
Machine
Generated
Rate at which
data is collected
Constrained by
collection device
Based on
calibration of
collection
device
Depends on
consistent
collection
devices
Expectation based
on the process the
data represents
Unstructured ?? Not relevant Not applicable ??
No expectation of
consistency
84. Proprietary
Data Quality Challenges – Intensified by Big Data
• The semiotic challenge: People have different ways of representing the “same” concepts
– Traditional: GOVERNANCE
– Big: Governance, but at the category and metadata level
• The technical challenge: Different technical approaches to creating and using data influence the data itself.
– Traditional: DATA STANDARDS
– Big: Manage the ingest process, esp. manage metadata up front
• The knowledge challenge: Because data is complicated, a single individual cannot know all the data.
– Traditional: METADATA
– Big: Metadata is even more important
• The political challenge: Data is knowledge, knowledge is power, power is political
– Traditional: GOVERNANCE / CULTURE
– Big: Governance/Culture
86. Proprietary
Meeting the Challenges for Big Data
METADATA – Addressing the knowledge challenge
• Production and Lineage: Data comes from many places;
very little control over the inputs. Need to know where data
comes from
• Inventory: Inventorying how much data an organization has
/ what data it has
• Storage: Need to understand how ingest and storage
process impacts data
• Usage: We will never know all the potential uses of data.
Ensure consumers know what the data represents, how it
was produced, how it is stored
• Metadata management: Enabling data usage by managing
knowledge of data; set minimum requirements for metadata
related to big data. The priorities change. It is not possible to
define every field in the way you would with traditional data
GOVERNANCE – Addressing data, process,
and cultural risks
• Accountability: Defining data ownership and
accountability
• Protection: Protecting against the misuse of
data
• Risk mitigation: Managing risks associated
with data
• Standards: Defining and enforcing standards
for data quality
87. Proprietary
Summary
Product Management practices do work for traditional
data.
They also work for Big Data, but with modifications
based on the production processes of Big Data.
Managing the quality of both Big Data and traditional
(little) data is dependent on managing metadata.
The process of figuring out how to manage your data
will significantly inform what you need to do to govern
your data via standards and monitoring.
88. Proprietary
Meeting the Challenges with Big and Little Data:
Characteristics of a Trusted Source of Data
1. SECURE: Data is protected against inappropriate access or use through policies, processes, and tools.
2. RELIABLE: Data processing is predictable and reliable. The system is monitored for performance. Controls are in
place to detect and respond to unexpected events.
3. DATA QUALITY IS KNOWN: The criteria for high quality data are defined. Levels of quality are measured and
reported on. Data issues are communicated to data consumers and remediated based on business priorities.
4. TRANSPARENT AND COMPREHENSIBLE: Data consumers have the information (Metadata) they need to
understand and get value from the data. Knowledge about the system and its data is documented, accessible,
usable, and current.
5. SUPPORTED: A dedicated production support team is in place and has the processes and protocols it needs to
respond in a timely manner to questions and issues related to the operation of the system and the data in the
system.
6. COMMUNICATED: New data consumers have access to relevant training; existing data consumers are informed of
changes that impact their uses of the data.
7. GOVERNED: Processes and accountabilities are in place to make decisions about the data in the system.
89. Proprietary
Goals from Agenda
Agenda
• Introductions
• Quality management concepts and principles
• Applying quality management to traditional data
• Big Data challenges
• Data Quality Practices for Big Data and Little Data