Data Profiling:
The First Step to Big Data Quality
Harald Smith, Dir. Product Marketing
Housekeeping
Webcast Audio
• Today’s webcast audio is streamed through your computer speakers.
• If you need technical assistance with the web interface or audio,
please reach out to us using the chat window.
Questions Welcome
• Submit your questions at any time during the presentation
using the chat window.
• Our team will reach out to you to answer them following the
presentation.
Recording and slides
• This webcast is being recorded. You will receive an
email following the webcast with a link to download
both the recording and the slides.
Speaker
Harald Smith
• Director of Product Marketing, Syncsort
• 20+ years in Information Management with a focus on
data quality, integration, and governance
• Co-author of Patterns of Information Management
• Author of two Redbooks on Information Governance
and Data Integration
• Blog author: “Data Democratized”
Only 35%of senior executives have a
high level of trust in the
accuracy of their Big Data
Analytics
KPMG 2016 Global CEO Outlook
92% of
executives are concerned
about the negative impact of
data and analytics on
corporate reputation
KPMG 2017 Global CEO Outlook
80%of AI/ML projects are stalling
due to poor data quality
Dimensional Research, 2019
Big Data Needs
Data Quality
“Societal trust in business is
arguably at an all-time low
and, in a world increasingly
driven by data and
technology,
reputations and brands are
ever harder to protect.”
EY “Trust in Data and Why it Matters”, 2017.
The importance of data
quality in the enterprise:
• Decision making
• Customer centricity
• Compliance
• Machine learning & AI
“
”
The magic of machine learning is that you build a
statistical model based on the most valid dataset for
the domain of interest.
If the data is junk, then you’ll be building a junk
model that will not be able to do its job.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018
Data Quality Challenges with Machine Learning
Incorrect, Incomplete, Mis-Formatted, and Sparse “Dirty Data” –
Mistakes and errors are almost never the patterns you’re looking for in
a data set. Sparse data generates other issues. Correcting and
standardizing will tend to boost the signal, but must account for bias.
Missing context – Many data sources lack context around location or
population segments. Unless enriched with other data sets, (e.g.
geospatial, demographics, or firmographics data), some ML algorithms
will not be usable.
Multiple copies – If your data comes from many sources, as it often
does, it may contain multiple records of information about the same
person, company, product or other entity. Removing duplicates and
enhancing the overall depth and accuracy of knowledge about a single
entity can make a huge difference.
Spurious correlations – Just as missing context may hinder some ML
algorithms, inclusion of already correlated data (e.g. city and postal
code) may result in overfitting of ML algorithms.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
But data analysts may not be aware of
specific data quality issues that must be
addressed to support machine learning.
Traditional data quality processes are
an effective method to identify defects.
Understanding Big Data Quality
Data Profiling
The set of analytical techniques that
evaluate actual data content (vs.
metadata) to provide a complete view
of each data element in a data source.
Provides summarized inferences, and
details of value and pattern frequencies
to quickly gain data insights.
Business Rules
The data quality or validation rules that
help ensure that data is “fit for use” in
its intended operational and decision-
making contexts.
Covers the accuracy, completeness,
consistency, relevance, timeliness and
validity of data.
Five Key Steps to effective Data Profiling
These are not new, but good to reiterate in the
context of Big Data:
1. How you want to analyze the data?
2. What should you review? (there's a lot of stuff)
3. What should you look for? (based on data “type”)
4. When should you build rules? (laser-focus; CDE’s)
5. What needs to be communicated?
1. How do you want to analyze the data?
Universal DQ best practices:
Understand the End Goal
• How does the business intend to
use the data (i.e. what’s the use
case)?
• Empower users (“Who”) to gain
new clarity into the core problem
(“Why”)
• What will the data be used for?
• What defines the Fitness for your
Purpose?
Establish Scope
• Ask the “right questions” about the
use case and the data (not just
“what” and “how”)
• What data is relevant to the effort?
• Big Data or other, you need to set
boundaries for the work
Understand Context
• How does the business define the
data?
• What are the important
characteristics and context of the
data?
• What are the Critical Data
Elements?
• What qualities will you need to
address, or leave alone?
• “High-quality data” definition will
vary by business problem“If you don’t know what you want to
get out of the data, how can you
know what data you need – and
what insight you’re looking for?”
Wolf Ruzicka, Chairman of the Board at EastBanc Technologies,
Blog post: June 1, 2017, “Grow A Data Tree Out Of The “Big Data”
Swamp”
“
”
Never lead with a data set;
lead with a question.
Anthony Scriffignano, Chief Data Scientist, Dun & Bradstreet
Forbes Insights, May 31, 2017, “The Data Differentiator”
To Sample or not to Sample?
Sampling helps with:
• Data Integration
• Source-to-target mapping
• Data Modeling
• Discovering Correlations
When the focus is on the structure of the data
❖ REMEMBER: your target is a statistically
valid sample!
❖ ~16k records gives you 99% confidence
with a margin of error of 1% for 100B
records
❖ ~66k records gives you 99% confidence
with a margin of error of .5% for same
Full Volume needed with:
• Data Quality
• Data Governance
• Regulatory Compliance
• Finding Outliers and Issues
with Content
• “Needles in the haystack”
When the focus is on the quality of or risks
within the data
❖ Focus on critical data elements and
leverage tools that scale to data volume
Big Data at scale distributes data across many
nodes – not necessarily with other relevant data!
• Processing routines must apply same approach and logic each
time
• Implications for profiling, joining, sorting, and matching data,
whether for enrichment, verification against trusted sources, or a
consolidated single view
Data Quality functions must be performed in a consistent manner,
no matter where actual processing takes place, how the data is
segmented, and what the data volume is.
• Data quality cleansing and preparation routines have to be
reproduced at scale, both to get the data ready to train machine
learning models, and to comply with business regulations.
• Critical to establishing, building, and maintaining trust
Scaling Data Quality best practices:
Consistent processing at scale
Source: HP Analyst Briefing
2. What do you want to review?
Common Data Quality Measurements
What measures can we take advantage of?
1. Completeness – Are the relevant fields populated?
2. Integrity – Does the data maintain an internal structural
integrity or a relational integrity across sources
3. Uniqueness – Are keys or records unique?
4. Validity – Does the data have the correct values?
• Code and reference values
• Valid ranges
• Valid value combinations
5. Consistency – Is the data at consistent levels of
aggregation or does it have consistent valid values
over time?
6. Timeliness – Did the data arrive in a time period
that makes it useful or usable?
New data, new data quality challenges
• 3rd Party and external data with unknown provenance or relevance
• Bias in the data – whether in collection, extraction, or other processing
• Data without standardized structure or formatting
• Continuously streaming data
• Disjointed data (e.g. gaps in receipt)
• Consistency and verification of data sources
• Changes and transformation applied to data (i.e. does it really
represent the original input)
New Data Quality Problems
“34 percent of bankers in our survey report that their organization
has been the target of adversarial AI at least once, and 78 percent
believe automated systems create new risks, such as fake data,
external data manipulation, and inherent bias.”
Accenture Banking Technology Vision 2018
• Contextual visualizations
• Value and pattern distributions
• Attribute summaries and metadata
• Sort and filter to quickly find data
of interest
• Detail drilldowns to any content
Let Data Profiling guide you
3. What should you look for?
Common Data Types
What variances do you need awareness of?
1. Identifiers – data that uniquely identifies something
2. Indicators – data that flags a specific condition
3. Dates – data that identifies a point in time
4. Quantities – data that identifies an amount or value of something
5. Codes – data that segments other data
6. Text – data that describes or names something
Identifiers
Use cases:
• Business Operations
• 360 View of Entity
• BI Reporting (incl. EDW)
• Analytics
• AI/ML
Examples:
• Customer ID
• National ID / Passport #
• Social Security # / Tax ID
• Product ID
What to look for:
• 100% Complete
• All Unique values
• Anomalous patterns
• Numeric vs. String
Notes:
• Needs full volume assessment
Indicators (aka Flags)
Use cases:
• Business Operations
• 360 View of Entity
• BI Reporting (incl. EDW)
• Governance and Compliance
• Analytics
• AI/ML
Examples:
• True / False (or T/F)
• Yes / No (or Y/N)
• 1 / 0
What to look for:
• Binary Values only
• Consistent pattern
• No mixing of “Y” vs “YES”
• If NULL occurs, it must be
one of the binary values
• Skews in frequency
distributions
Notes:
• May need segmentation, filtering, or
grouping via business rules to resolve or
clarify discrepancies
• Often are triggers for other conditions –
look for use in business rules, but likely
occur downstream
Codes
Use cases:
• Business Operations
• 360 View of Entity
• BI Reporting (incl. EDW)
• Governance and Compliance
• Analytics
• AI/ML
Examples:
• Account Status
• Credit Rating
• Diagnosis/Procedure Codes
• Order Status
• Postal Code
What to look for:
• Expected values
• Consistent patterns
• No mixing of “A” vs “active”
• NULL values
• Skews in frequency
distributions
Notes:
• May need segmentation, filtering, or
grouping via business rules to resolve or
clarify discrepancies
• Often are triggers for or from other
conditions – look for use in business rules
• May correlate to other fields
Dates
Use cases:
• Business Operations
• BI Reporting (incl. EDW)
• Governance and Compliance
• Analytics
• AI/ML
Examples:
• Birth Date
• Departure Date
• Order Date
• Shipping Date
• Timestamp
What to look for:
• Skews in frequency
distributions
• E.g. 01/01/2001
• Anomalous patterns
• Numeric vs. String
• Unusual values
• Missing values and gaps
Notes:
• May need segmentation, filtering, or
grouping via business rules to resolve or
clarify
Quantities
Use cases:
• Business Operations
• BI Reporting (incl. EDW)
• Governance and Compliance
• Analytics
• AI/ML
Examples:
• Amount (e.g. item count, amount due)
• Price
• Sales
• Total (e.g. order total)
What to look for:
• Skews in frequency
distributions
• Anomalous patterns
• Excessively high (or low)
values
Notes:
• May need segmentation, filtering, or
grouping via business rules to resolve or
clarify
Text
Use cases:
• Business Operations
• Building blocks for other
identifiers!
• 360 View of Entity
• Governance and Compliance
• Analytics
• AI/ML
Examples:
• Name
• Address
• Product Description
• Claim Description
What to look for:
• Missing Values
• Frequency of patterns /
Anomalous patterns
• Existence of numerics
• Values <= 5 characters
• Compound values
• Unusual, recurring values
• “Do not use”
Notes:
• Look for correlations with Code values
that indicate specific conditions (e.g.
values used for testing purposes)
4. When do you build rules?
Focus on:
• Critical Data Elements (data quality dimensions)
• Policy-based conditions (e.g. regulatory
compliance)
• Correlated data conditions (e.g. If x, then y)
• Filtering and segmenting data (refining
evaluations; investigating root cause)
Build Rules for Defined Conditions
• Validate critical requirements within or
across data sources
• Build common rules that can be readily
tested and shared
• Evaluate and remediate issues
• Take action on incorrect data and defaults
• Create flags for subsequent use in marking
or remediating data
• Filter result sets and export for additional
use
Benefits of Business Rules
5. What should you communicate?
Culture of Data Literacy
• “Democratization of Data” requires cultural support
• Empowered to ask questions about the data
• Trained to understand and use data
• Trained to understand approaching and evaluating data quality
• Traditional data, new data, machine learning requirements, …
• Understand the business context of the data
Program of Data Governance
• Provide the processes and practices necessary for success
• Measure, monitor, and improve
• Continous iteration and development
Center of Excellence/Knowledge Base
• Where do you go to find answers?
• Who can help show you how?
Communicate!
• Annotate what you’ve found
• Identify the subject and add a description that is meaningful
• Utilize flags, tags, and other indicators to help others distinguish
types and severity of issues
• Integrate into data governance and BI tools for maximum visibility
Annotate Results with Findings
Summary
Evaluating Big Data
It is challenging to keep the end
goal in mind
• Data comes from multiple
disparate systems & sources
• The number of touchpoints for
policies and rules has grown
• There is a higher demand and
expectation for seeing data
quality in context.
• You need to assess and measure
the data content if you
5 Key Steps
• Remember the end goal – ask
questions, use best practices,
and establish scope & context
• Consider what criteria and
dimensions are needed
• Focus your attention based on
the type of data and the use case
• Build rules when necessary to
get laser-focused
• Determine what needs to be
communicated and delivered
Gaining insight and measurement of data quality is more critical than ever!
Data Profiling: The First Step to Big Data Quality

Data Profiling: The First Step to Big Data Quality

  • 1.
    Data Profiling: The FirstStep to Big Data Quality Harald Smith, Dir. Product Marketing
  • 2.
    Housekeeping Webcast Audio • Today’swebcast audio is streamed through your computer speakers. • If you need technical assistance with the web interface or audio, please reach out to us using the chat window. Questions Welcome • Submit your questions at any time during the presentation using the chat window. • Our team will reach out to you to answer them following the presentation. Recording and slides • This webcast is being recorded. You will receive an email following the webcast with a link to download both the recording and the slides.
  • 3.
    Speaker Harald Smith • Directorof Product Marketing, Syncsort • 20+ years in Information Management with a focus on data quality, integration, and governance • Co-author of Patterns of Information Management • Author of two Redbooks on Information Governance and Data Integration • Blog author: “Data Democratized”
  • 4.
    Only 35%of seniorexecutives have a high level of trust in the accuracy of their Big Data Analytics KPMG 2016 Global CEO Outlook 92% of executives are concerned about the negative impact of data and analytics on corporate reputation KPMG 2017 Global CEO Outlook 80%of AI/ML projects are stalling due to poor data quality Dimensional Research, 2019 Big Data Needs Data Quality “Societal trust in business is arguably at an all-time low and, in a world increasingly driven by data and technology, reputations and brands are ever harder to protect.” EY “Trust in Data and Why it Matters”, 2017. The importance of data quality in the enterprise: • Decision making • Customer centricity • Compliance • Machine learning & AI
  • 5.
    “ ” The magic ofmachine learning is that you build a statistical model based on the most valid dataset for the domain of interest. If the data is junk, then you’ll be building a junk model that will not be able to do its job. James Kobeilus SiliconANGLE Wikibon Lead Analyst for Data Science, Deep Learning, App Development 2018
  • 6.
    Data Quality Challengeswith Machine Learning Incorrect, Incomplete, Mis-Formatted, and Sparse “Dirty Data” – Mistakes and errors are almost never the patterns you’re looking for in a data set. Sparse data generates other issues. Correcting and standardizing will tend to boost the signal, but must account for bias. Missing context – Many data sources lack context around location or population segments. Unless enriched with other data sets, (e.g. geospatial, demographics, or firmographics data), some ML algorithms will not be usable. Multiple copies – If your data comes from many sources, as it often does, it may contain multiple records of information about the same person, company, product or other entity. Removing duplicates and enhancing the overall depth and accuracy of knowledge about a single entity can make a huge difference. Spurious correlations – Just as missing context may hinder some ML algorithms, inclusion of already correlated data (e.g. city and postal code) may result in overfitting of ML algorithms. Correcting data problems vastly increases a data set’s usefulness for machine learning. But data analysts may not be aware of specific data quality issues that must be addressed to support machine learning. Traditional data quality processes are an effective method to identify defects.
  • 7.
    Understanding Big DataQuality Data Profiling The set of analytical techniques that evaluate actual data content (vs. metadata) to provide a complete view of each data element in a data source. Provides summarized inferences, and details of value and pattern frequencies to quickly gain data insights. Business Rules The data quality or validation rules that help ensure that data is “fit for use” in its intended operational and decision- making contexts. Covers the accuracy, completeness, consistency, relevance, timeliness and validity of data.
  • 8.
    Five Key Stepsto effective Data Profiling These are not new, but good to reiterate in the context of Big Data: 1. How you want to analyze the data? 2. What should you review? (there's a lot of stuff) 3. What should you look for? (based on data “type”) 4. When should you build rules? (laser-focus; CDE’s) 5. What needs to be communicated?
  • 9.
    1. How doyou want to analyze the data?
  • 10.
    Universal DQ bestpractices: Understand the End Goal • How does the business intend to use the data (i.e. what’s the use case)? • Empower users (“Who”) to gain new clarity into the core problem (“Why”) • What will the data be used for? • What defines the Fitness for your Purpose? Establish Scope • Ask the “right questions” about the use case and the data (not just “what” and “how”) • What data is relevant to the effort? • Big Data or other, you need to set boundaries for the work Understand Context • How does the business define the data? • What are the important characteristics and context of the data? • What are the Critical Data Elements? • What qualities will you need to address, or leave alone? • “High-quality data” definition will vary by business problem“If you don’t know what you want to get out of the data, how can you know what data you need – and what insight you’re looking for?” Wolf Ruzicka, Chairman of the Board at EastBanc Technologies, Blog post: June 1, 2017, “Grow A Data Tree Out Of The “Big Data” Swamp”
  • 11.
    “ ” Never lead witha data set; lead with a question. Anthony Scriffignano, Chief Data Scientist, Dun & Bradstreet Forbes Insights, May 31, 2017, “The Data Differentiator”
  • 12.
    To Sample ornot to Sample? Sampling helps with: • Data Integration • Source-to-target mapping • Data Modeling • Discovering Correlations When the focus is on the structure of the data ❖ REMEMBER: your target is a statistically valid sample! ❖ ~16k records gives you 99% confidence with a margin of error of 1% for 100B records ❖ ~66k records gives you 99% confidence with a margin of error of .5% for same Full Volume needed with: • Data Quality • Data Governance • Regulatory Compliance • Finding Outliers and Issues with Content • “Needles in the haystack” When the focus is on the quality of or risks within the data ❖ Focus on critical data elements and leverage tools that scale to data volume
  • 13.
    Big Data atscale distributes data across many nodes – not necessarily with other relevant data! • Processing routines must apply same approach and logic each time • Implications for profiling, joining, sorting, and matching data, whether for enrichment, verification against trusted sources, or a consolidated single view Data Quality functions must be performed in a consistent manner, no matter where actual processing takes place, how the data is segmented, and what the data volume is. • Data quality cleansing and preparation routines have to be reproduced at scale, both to get the data ready to train machine learning models, and to comply with business regulations. • Critical to establishing, building, and maintaining trust Scaling Data Quality best practices: Consistent processing at scale Source: HP Analyst Briefing
  • 14.
    2. What doyou want to review?
  • 15.
    Common Data QualityMeasurements What measures can we take advantage of? 1. Completeness – Are the relevant fields populated? 2. Integrity – Does the data maintain an internal structural integrity or a relational integrity across sources 3. Uniqueness – Are keys or records unique? 4. Validity – Does the data have the correct values? • Code and reference values • Valid ranges • Valid value combinations 5. Consistency – Is the data at consistent levels of aggregation or does it have consistent valid values over time? 6. Timeliness – Did the data arrive in a time period that makes it useful or usable?
  • 16.
    New data, newdata quality challenges • 3rd Party and external data with unknown provenance or relevance • Bias in the data – whether in collection, extraction, or other processing • Data without standardized structure or formatting • Continuously streaming data • Disjointed data (e.g. gaps in receipt) • Consistency and verification of data sources • Changes and transformation applied to data (i.e. does it really represent the original input) New Data Quality Problems “34 percent of bankers in our survey report that their organization has been the target of adversarial AI at least once, and 78 percent believe automated systems create new risks, such as fake data, external data manipulation, and inherent bias.” Accenture Banking Technology Vision 2018
  • 17.
    • Contextual visualizations •Value and pattern distributions • Attribute summaries and metadata • Sort and filter to quickly find data of interest • Detail drilldowns to any content Let Data Profiling guide you
  • 18.
    3. What shouldyou look for?
  • 19.
    Common Data Types Whatvariances do you need awareness of? 1. Identifiers – data that uniquely identifies something 2. Indicators – data that flags a specific condition 3. Dates – data that identifies a point in time 4. Quantities – data that identifies an amount or value of something 5. Codes – data that segments other data 6. Text – data that describes or names something
  • 20.
    Identifiers Use cases: • BusinessOperations • 360 View of Entity • BI Reporting (incl. EDW) • Analytics • AI/ML Examples: • Customer ID • National ID / Passport # • Social Security # / Tax ID • Product ID What to look for: • 100% Complete • All Unique values • Anomalous patterns • Numeric vs. String Notes: • Needs full volume assessment
  • 21.
    Indicators (aka Flags) Usecases: • Business Operations • 360 View of Entity • BI Reporting (incl. EDW) • Governance and Compliance • Analytics • AI/ML Examples: • True / False (or T/F) • Yes / No (or Y/N) • 1 / 0 What to look for: • Binary Values only • Consistent pattern • No mixing of “Y” vs “YES” • If NULL occurs, it must be one of the binary values • Skews in frequency distributions Notes: • May need segmentation, filtering, or grouping via business rules to resolve or clarify discrepancies • Often are triggers for other conditions – look for use in business rules, but likely occur downstream
  • 22.
    Codes Use cases: • BusinessOperations • 360 View of Entity • BI Reporting (incl. EDW) • Governance and Compliance • Analytics • AI/ML Examples: • Account Status • Credit Rating • Diagnosis/Procedure Codes • Order Status • Postal Code What to look for: • Expected values • Consistent patterns • No mixing of “A” vs “active” • NULL values • Skews in frequency distributions Notes: • May need segmentation, filtering, or grouping via business rules to resolve or clarify discrepancies • Often are triggers for or from other conditions – look for use in business rules • May correlate to other fields
  • 23.
    Dates Use cases: • BusinessOperations • BI Reporting (incl. EDW) • Governance and Compliance • Analytics • AI/ML Examples: • Birth Date • Departure Date • Order Date • Shipping Date • Timestamp What to look for: • Skews in frequency distributions • E.g. 01/01/2001 • Anomalous patterns • Numeric vs. String • Unusual values • Missing values and gaps Notes: • May need segmentation, filtering, or grouping via business rules to resolve or clarify
  • 24.
    Quantities Use cases: • BusinessOperations • BI Reporting (incl. EDW) • Governance and Compliance • Analytics • AI/ML Examples: • Amount (e.g. item count, amount due) • Price • Sales • Total (e.g. order total) What to look for: • Skews in frequency distributions • Anomalous patterns • Excessively high (or low) values Notes: • May need segmentation, filtering, or grouping via business rules to resolve or clarify
  • 25.
    Text Use cases: • BusinessOperations • Building blocks for other identifiers! • 360 View of Entity • Governance and Compliance • Analytics • AI/ML Examples: • Name • Address • Product Description • Claim Description What to look for: • Missing Values • Frequency of patterns / Anomalous patterns • Existence of numerics • Values <= 5 characters • Compound values • Unusual, recurring values • “Do not use” Notes: • Look for correlations with Code values that indicate specific conditions (e.g. values used for testing purposes)
  • 26.
    4. When doyou build rules?
  • 27.
    Focus on: • CriticalData Elements (data quality dimensions) • Policy-based conditions (e.g. regulatory compliance) • Correlated data conditions (e.g. If x, then y) • Filtering and segmenting data (refining evaluations; investigating root cause) Build Rules for Defined Conditions
  • 28.
    • Validate criticalrequirements within or across data sources • Build common rules that can be readily tested and shared • Evaluate and remediate issues • Take action on incorrect data and defaults • Create flags for subsequent use in marking or remediating data • Filter result sets and export for additional use Benefits of Business Rules
  • 29.
    5. What shouldyou communicate?
  • 30.
    Culture of DataLiteracy • “Democratization of Data” requires cultural support • Empowered to ask questions about the data • Trained to understand and use data • Trained to understand approaching and evaluating data quality • Traditional data, new data, machine learning requirements, … • Understand the business context of the data Program of Data Governance • Provide the processes and practices necessary for success • Measure, monitor, and improve • Continous iteration and development Center of Excellence/Knowledge Base • Where do you go to find answers? • Who can help show you how? Communicate!
  • 31.
    • Annotate whatyou’ve found • Identify the subject and add a description that is meaningful • Utilize flags, tags, and other indicators to help others distinguish types and severity of issues • Integrate into data governance and BI tools for maximum visibility Annotate Results with Findings
  • 32.
    Summary Evaluating Big Data Itis challenging to keep the end goal in mind • Data comes from multiple disparate systems & sources • The number of touchpoints for policies and rules has grown • There is a higher demand and expectation for seeing data quality in context. • You need to assess and measure the data content if you 5 Key Steps • Remember the end goal – ask questions, use best practices, and establish scope & context • Consider what criteria and dimensions are needed • Focus your attention based on the type of data and the use case • Build rules when necessary to get laser-focused • Determine what needs to be communicated and delivered Gaining insight and measurement of data quality is more critical than ever!