SlideShare a Scribd company logo
Applying Data Quality Best
Practices at Big Data Scale
Harald Smith
Director of Product Management
Michael Urbonas
Director of Product Marketing
Speakers
Mike Urbonas
Director of Product Marketing,
Syncsort
15 years of software experience
including
– BI/DW & data visualization
– Data management & ETL
– Text analytics
– Enterprise search
Harald Smith
Director of Product Management,
Syncsort
20 years in Information
Management incl. data quality,
integration, and governance
– Consulting, product management,
software & solution development
Co-author of Patterns of Information
Management, as well as two Redbooks
on Information Governance and Data
Integration
Today’s agenda
Problem: Huge Big Data investments, Scarce Big Data trust
– New insights from Syncsort 2017 Big Data Trends survey
– Root causes of Data Lake distrust
Sample use cases at Big Data scale
– 360 degree view of the customer, product or other core entity
– Anti-fraud
Solution: Bringing Data Quality best practices into the Data Lake
– “Design once, deploy anywhere” with Syncsort/Trillium technology approach
– “Intelligent execution” to leverage the strength of the Big Data platform
Nobody wants a data swamp!
“This sure looked a lot nicer
on the whiteboard…”
Key problem:
Big Data deemed untrustworthy by business managers and leaders
Only 33% of senior
execs have a high level
of trust in the accuracy
of their Big Data
analytics ~ KPMG 2016
Key problem:
Big Data deemed untrustworthy by business managers and leaders
Only 33% of senior
execs have a high level
of trust in the accuracy
of their Big Data
analytics ~ KPMG 2016
59% of global execs do
not believe their company
has capabilities to
generate meaningful
business insights from
their data ~ Bain 2015
Key problem:
Big Data deemed untrustworthy by business managers and leaders
Only 33% of senior
execs have a high level
of trust in the accuracy
of their Big Data
analytics ~ KPMG 2016
85% of global execs say
major investments are
required to update their
existing data platform,
including data cleaning
and consolidating ~ Bain 2015
59% of global execs do
not believe their company
has capabilities to
generate meaningful
business insights from
their data ~ Bain 2015
Fresh insights from Syncsort 2017 Big Data Trends survey
Data Quality is recognized as a
mission-critical data lake
success factor
– Data Quality tops the list of
challenges of data lake
implementation, followed
closely by Data Governance
Fresh insights from Syncsort 2017 Big Data Trends survey
Data Quality is recognized as a
mission-critical data lake
success factor
– Data Quality tops the list of
challenges of data lake
implementation, followed
closely by Data Governance
Financial services and
insurance industry is most
focused on Data Quality and
Data Governance
– Named Data Quality as top
priority 50% more often than
participants from other
industries
– Also identified Data
Governance as a top priority at
more than twice the rate of
those from other industries
Fresh insights from Syncsort 2017 Big Data Trends survey
Data Quality is recognized as a
mission-critical data lake
success factor
– Data Quality tops the list of
challenges of data lake
implementation, followed
closely by Data Governance
But… not everyone is making
the connection between Data
Quality and Big Data success
– Participants who did not
include data quality as a top 3
priority for implementing the
data lake expressed the most
interest in analytically-
intensive data lake uses…
which are highly dependent on
proper data quality
Financial services and
insurance industry is most
focused on Data Quality and
Data Governance
– Named Data Quality as top
priority 50% more often than
participants from other
industries
– Also identified Data
Governance as a top priority at
more than twice the rate of
those from other industries
Root causes of Big Data mistrust
Root causes of Big Data mistrust
Are these numbers
accurate? Are
calculations using
correctly aggregated
data?
Is this data current?
When was it last
updated?
Are these terms
consistent with our
business definitions?
Can I trust this data
enough to make key
decisions and/or allow
the data to be used in
real-time?
Did we include all of
the data we should
have? Are additional
data sources missing?
Root causes of Big Data mistrust… examples
False
Assumptions
Pinterest targeted marketing
campaign mistakenly
congratulated single women
on upcoming weddings...
Root causes of Big Data mistrust… examples
False
Assumptions
Pinterest targeted marketing
campaign mistakenly
congratulated single women
on upcoming weddings...
Miscoded/
Misinterpreted Data
Predictive analysis falsely
found call center workers
without a HS diploma were
3x more likely to remain on
board for at least 9 months…
…?
Root causes of Big Data mistrust… examples
False
Assumptions
Pinterest targeted marketing
campaign mistakenly
congratulated single women
on upcoming weddings...
Duplicate
Data
Fraud examination revealed
massive import tariff evasion
on eggs, only to find there
was no case to crack…
Miscoded/
Misinterpreted Data
Predictive analysis falsely
found call center workers
without a HS diploma were
3x more likely to remain on
board for at least 9 months…
…?
Sample use cases at Big Data scale…
360 view of customer (or product, or other key entity)
Is Data Lake essential for this use case?
– YES… Purpose of customer 360 is to optimize customer
experience management
– Increasingly broad spectrum of data sources involved in and
required for effectively personalizing customer experiences
and targeted marketing offers
What Types of Data?
– Internal sources – often many/overlapping
– 3rd Party data – demographics
– Suppression data – keeping customer information updated
– New sources – mobile, social media
Internal Data
 Customer Master Data
 Point-of-Sale Data
 Contact Form Data
 Loyalty Program Data
 ecommerce Data
 Customer Service Data
Suppression Data
 Change of Address
 Mortality
 Do Not Call
Third-Party Data
 Age
 Occupation
 Education
 Gender
 Income
 Geographic
Sample use cases at Big Data scale…
Anti-Fraud/Anti-Money Laundering
Is Data Lake essential for this use case?
– YES… Fraudulent transaction detection requires huge volumes
of customer profile data, recent transaction activity with “last
known” values, device data with geolocation and time-based
tagging, 3rd party news/alerts
– Data used to refine Machine Learning models (e.g., anomaly
detection, implausible behavior analysis) to review new
transactions in real time
What Types of Data?
– Internal sources – often many/overlapping
– Suppression data – keeping customer information updated
– Mobile data – devices, locations
– New sources – social media, 3rd party data, …
Internal Data
 Customer Master Data
 Point-of-Sale Data
 Contact Form Data
 Loyalty Program Data
 ecommerce Data
 Customer Service Data
Mobile Data
 Device
 Location
 Wearables
 Mobile wallets
Suppression Data
 Change of Address
 Mortality
 Do Not Call
Social Data
 Sentiment
 Opinions
 Interests
 Social handles
The Fundamental Data Quality Question:
What are you trying to do?
“Never lead with a data set;
lead with a question.”
Anthony Scriffignano
Chief Data Scientist, Dun & Bradstreet
Forbes Insights, May 31, 2017
“The Data Differentiator”
“If you don’t know what you want to get out of
the data, how can you know what data you need
– and what insight you’re looking for?”
Wolf Ruzicka
Chairman of the Board at EastBanc Technologies
Blog post: June 1, 2017
“Grow A Data Tree Out Of The “Big Data” Swamp”
Understanding Data Quality best practices:
Where to start?
Establishing Scope
Asking the “right questions” about your data (not just “what” and “how”)
– “Why” questions to understand core business problem
– “Who” questions to understand varying needs of all involved users (role, function, etc.)
Empowering users (“Who”) to gain new clarity into the core problem (“Why”)
– Bringing together data sources relevant to asking insightful questions of the data
– Enabling the data to answer the questions freely
– Building data analytics, algorithms, machine learning, etc. to expedite and broadcast
answers
Above lines of inquiry inform what Data Quality processing is required
– Determining how, what and where Data Quality is established based on business
problem
– “High-quality data” definition will vary by business problem
Understanding Data Quality best practices:
What’s the End Goal?
The End Goal drives Data Quality Requirements & Processes
Do you have all the data required?
– What’s the central entity? E.g. Customer, Product, Asset
– What’s the definition? E.g. “Customers” may mean customers, prospects, store visitors, …
– Are the sources comprehensive? E.g. any data silos? cover all geographies?
– Will “new” information be added? E.g. demographics, geolocation, …
How will data be matched, consolidated, or connected?
– One “golden” record? Or multiple links to connect all the dots?
What’s needed to facilitate the matching, consolidation, or connection required?
– E.g. Customer may need: Name, Address, Geolocation, Phone, Email
Have you evaluated the sources?
– Are the data sources “Fit for Purpose”?
Applying Data Quality best practices:
Identifying required Data Quality dimensions
What data do we care about?
• What are the Critical Data Elements?
What measures can we take advantage of?
1) Completeness – Are the relevant fields populated?
2) Integrity – Does the data maintain an internal
structural integrity or a relational integrity across
sources
3) Uniqueness – Are keys or records unique?
4) Validity – Does the data have the correct values?
5) Consistency – Is the data at consistent levels of
aggregation or does it have consistent valid values
over time?
6) Timeliness – Did the data arrive in a time period that
makes it useful or usable?
Example: Call Center Record
Example: Call Center Record
Unique

Integrity

Complete
? Consistent

Timely

Valid ?
Is Duration = 0 important?
Is 01/01/20xx a defaulted date?
And how will this be linked or
connected with my other data?
The file appears complete, but
does it cover all call centers?
Example: Twitter Feed
Example: Twitter Feed
Unique
?
Integrity?
Complete?
Consistent
?
Timely?
Valid?
What else can we review or measure?
1) Coverage (Relevance) – How well does the data source meet the defined needs?
– E.g. does it cover the relevant geography? Is it biased?
2) Continuity – Data points for all intervals or expected intervals?
– E.g. sensors, weather records, call data records
3) Triangulation – What Gartner describes as ‘consistency of data across proximate data points’, i.e. consistent
measurements from related points of reference.
– E.g. if temperatures in Chicago and Louisville are 30°and 32°then temperature in Indianapolis for same day is
unlikely to be 70°
4) Provenance – Where did the data originate, who gathered it, and what criteria was used to create it?
– E.g. government agency, 3rd party provider, free or paid data
5) Transformation from origin – how many layers and/or changes has the data passed through?
– E.g. has the original data source already been merged with two other record sources? And is the result accurate?
6) Repetition or duplication of data patterns – Data points exactly the same across multiple recording intervals
or across multiple sensors.
– E.g. is there tampering with sensors or call data?
Applying Data Quality best practices:
‘New’ or ‘Extended’ Measures of Data Quality
Example: Twitter Feed
Triangulated
Continuity
Provenance
Coverage
Usage
Repeated
patterns
Transformation
Example: Twitter Feed
Triangulated
Continuity
Provenance
Coverage
Usage
Repeated
patterns
Transformation
Jane Doe pulled
from Twitter based
on #Blackberry
All items for #Blackberry
in time interval appear to
be included
Marketing confirms
these have high
value
Good association
with current sales
data
All tweets appear
unique within the
date & vs. prior
feeds
This needed to
include #BB and
#Crackberry as well!
Applying Data Quality best practices:
Understanding Context
Context is critical:
Even on data that is considered
“common” or “understood” such as
Name or Address or Product
Description
To parse or standardize data to
useful and usable components for
additional processing
To determine when and where to
verify or enrich the data content
To determine whether and how to
match records to a given entity
To identify whether to consolidate
data, and if so what other data
drives the consolidation
Applying Data Quality best practices:
Assessing Quality Requirements
Entity data (customer, product,
asset, …):
Requires understanding data
provenance and context
Requires integrating data from
multiple data sources
Requires determining whether
specific data should even be included
Presents differences in coverage,
completeness, consistency,
provenance, …
Comes from different points in time
May contain repetitions, particularly
from 3rd Party data sources
May contain data at different levels
of consolidation or aggregation
Robert Smith Jr
3 Davy Drive
S66 7EN
bsmith850@gmail.com
+44(0)1189 823606
Rotherham
Name
Address
City
Postal Code
Phone
Email
3rd Party
Applying Data Quality best practices:
Utilizing Data Quality functions to achieve required DQ dimensions
Parse data values from unstructured
fields to their correct domains
Standardize values to enable higher
quality matching and linkage
Verify and enrich global postal
addresses and geolocations
Enrich data from external, third-party
sources to create comprehensive,
unified records
Match and link like records
Consolidate and aggregate to “golden”
record, if appropriate, based on factors
such as data source, date, …
Match records that belong to the same
domain (i.e., household or business)
Smith
3 Davy Drive
S66 7EN
Rotherham
Name
Address
City
Postal Code
Household View
Applying Data Quality best practices:
Example
Large telco organization:
“What are our customers
saying about us in the
marketplace?
Where are the most common
complaints are coming from?
Issue: sparse results
concentrated in one region
Required: standardization,
enrichment, geocoding,
matching/record linkage,
address verification
Before Data Quality After Data Quality
Applying Data Quality best practices:
Example
Large telco organization:
“What are our most
profitable regions on a daily
basis?
Which are the most
profitable regions?
Issue: poor geolocation
identifying wrong regions
Required: parsing,
standardization, address
verification, enrichment,
geocoding, matching/record
linkage
Before Data Quality After Data Quality
Applying Data Quality best practices:
Consistent processing
Big Data at scale distributes data across many nodes –
not necessarily with other relevant data!
– Implications for joining, sorting, and matching data,
whether for enrichment, verification against trusted
sources, or a consolidated single view
Data Quality functions must be performed in a
consistent manner, no matter where actual processing
takes place, how the data is segmented, and what the
data volume is
– Processing routines must apply same approach and logic
each time
– Critical to establishing, building, and maintaining trust
Source: HP Analyst Briefing
Trillium Quality for Big Data
Focus on Data Quality, not the Big Data platform
• Use existing Data Quality skills and expertise
• No need to worry about mappers, reducers, big side or small side of joins, etc
• Automatic optimization for best performance, load balancing, etc.
• No changes or tuning required, even if you change execution frameworks
• Future-proof job designs for emerging compute frameworks, e.g. Spark 2.x
• Run multiple execution frameworks in a single job
Single GUI Execute Anywhere!
35Syncsort Confidential and Proprietary - do not copy or distribute
Intelligent Execution - Insulate your organization from underlying complexities of Hadoop
Bring Data Quality best practices into the Data Lake:
The Syncsort/Trillium technology approach
“Design once, deploy anywhere”
– Visually design data quality jobs once and run anywhere
(MapReduce, Spark, Linux, Unix, Windows; on premise or
in the cloud)
– Use-case templates to fast-track development
– Test & debug locally in Windows/Linux; deploy to Big Data
– Intelligent Execution dynamically optimizes data
processing at run-time based on the chosen compute
framework; no changes or tuning required
Benefit: Significantly reduce manual data preparation
– Major time sink for data scientists, architects and analysts
– Risk of inconsistent or incomplete data preparation
Benefit: Significantly increase trust in data
– Major time sink for executives
– Risk of poor data-based business decisions
Single GUI
Execute Anywhere!
“Data is useful. High-quality, well-understood,
auditable data is priceless.”
Ted Friedman
VP Distinguished Analyst
Article in CRM.com: Mar 8, 2005
“The Coming of BI Competency Centers”
Data Quality remains Data Quality, even at scale
“Data is the new science. Big Data holds the
answers. Are you asking the right questions?”
Pat Gelsinger
President and COO at EMC
Forbes Insights, June 22, 2012
“Big Bets On Big Data”
Questions and Next Steps
For more information on Trillium Quality for Big Data, visit: trilliumsoftware.com/products/big-data
Contact Info:
Mike Urbonas, Director of Product Marketing, Syncsort/Trillium Software
murbonas@syncsort.com
https://www.linkedin.com/in/mikeu
Harald Smith, Director of Product Management, Syncsort/Trillium Software
harald_smith@trilliumsoftware.com
https://www.linkedin.com/in/harald-smith-71028b
Thank You!

More Related Content

What's hot

The data quality challenge
The data quality challengeThe data quality challenge
The data quality challenge
Lenia Miltiadous
 
Data Management Meets Human Management - Why Words Matter
Data Management Meets Human Management - Why Words MatterData Management Meets Human Management - Why Words Matter
Data Management Meets Human Management - Why Words Matter
DATAVERSITY
 
Speed Matters - Intelligent Strategies to Accelerate Data-Driven Decisions
Speed Matters - Intelligent Strategies to Accelerate Data-Driven DecisionsSpeed Matters - Intelligent Strategies to Accelerate Data-Driven Decisions
Speed Matters - Intelligent Strategies to Accelerate Data-Driven Decisions
DATAVERSITY
 
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...
DATAVERSITY
 
Data quality metrics infographic
Data quality metrics infographicData quality metrics infographic
Data quality metrics infographic
Intellspot
 
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipelineQlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Srikanth Sharma Boddupalli
 
Slides: Data Governance Reality Check
Slides: Data Governance Reality CheckSlides: Data Governance Reality Check
Slides: Data Governance Reality Check
DATAVERSITY
 
Slides: Applying Artificial Intelligence (AI) in All the Right Places in the ...
Slides: Applying Artificial Intelligence (AI) in All the Right Places in the ...Slides: Applying Artificial Intelligence (AI) in All the Right Places in the ...
Slides: Applying Artificial Intelligence (AI) in All the Right Places in the ...
DATAVERSITY
 
Slides: Achieving a “Single Source of Truth” with BI in Your Enterprise
Slides: Achieving a “Single Source of Truth” with BI in Your EnterpriseSlides: Achieving a “Single Source of Truth” with BI in Your Enterprise
Slides: Achieving a “Single Source of Truth” with BI in Your Enterprise
DATAVERSITY
 
Building an Effective Data & Analytics Operating Model A Data Modernization G...
Building an Effective Data & Analytics Operating Model A Data Modernization G...Building an Effective Data & Analytics Operating Model A Data Modernization G...
Building an Effective Data & Analytics Operating Model A Data Modernization G...
Mark Hewitt
 
Impact of BIG Data on MDM
Impact of BIG Data on MDMImpact of BIG Data on MDM
Impact of BIG Data on MDM
Subhendu Dey
 
2. Getvisibility. Innovative data governance, control & oversight
2. Getvisibility. Innovative data governance, control & oversight2. Getvisibility. Innovative data governance, control & oversight
2. Getvisibility. Innovative data governance, control & oversight
Vanessa Pulgarín Auquilla
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Noise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in DataNoise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in Data
DATAVERSITY
 
Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...
Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...
Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...
DATAVERSITY
 
Predictive vs Prescriptive Analytics
Predictive vs Prescriptive AnalyticsPredictive vs Prescriptive Analytics
Predictive vs Prescriptive Analytics
DATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing
Emerging Trends in Data Architecture – What’s the Next Big ThingEmerging Trends in Data Architecture – What’s the Next Big Thing
Emerging Trends in Data Architecture – What’s the Next Big Thing
DATAVERSITY
 
Slides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data GovernanceSlides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data Governance
DATAVERSITY
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
DATAVERSITY
 
DAS Slides: Data Architect vs. Data Engineer vs. Data Modeler
DAS Slides: Data Architect vs. Data Engineer vs. Data ModelerDAS Slides: Data Architect vs. Data Engineer vs. Data Modeler
DAS Slides: Data Architect vs. Data Engineer vs. Data Modeler
DATAVERSITY
 

What's hot (20)

The data quality challenge
The data quality challengeThe data quality challenge
The data quality challenge
 
Data Management Meets Human Management - Why Words Matter
Data Management Meets Human Management - Why Words MatterData Management Meets Human Management - Why Words Matter
Data Management Meets Human Management - Why Words Matter
 
Speed Matters - Intelligent Strategies to Accelerate Data-Driven Decisions
Speed Matters - Intelligent Strategies to Accelerate Data-Driven DecisionsSpeed Matters - Intelligent Strategies to Accelerate Data-Driven Decisions
Speed Matters - Intelligent Strategies to Accelerate Data-Driven Decisions
 
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...
 
Data quality metrics infographic
Data quality metrics infographicData quality metrics infographic
Data quality metrics infographic
 
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipelineQlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
 
Slides: Data Governance Reality Check
Slides: Data Governance Reality CheckSlides: Data Governance Reality Check
Slides: Data Governance Reality Check
 
Slides: Applying Artificial Intelligence (AI) in All the Right Places in the ...
Slides: Applying Artificial Intelligence (AI) in All the Right Places in the ...Slides: Applying Artificial Intelligence (AI) in All the Right Places in the ...
Slides: Applying Artificial Intelligence (AI) in All the Right Places in the ...
 
Slides: Achieving a “Single Source of Truth” with BI in Your Enterprise
Slides: Achieving a “Single Source of Truth” with BI in Your EnterpriseSlides: Achieving a “Single Source of Truth” with BI in Your Enterprise
Slides: Achieving a “Single Source of Truth” with BI in Your Enterprise
 
Building an Effective Data & Analytics Operating Model A Data Modernization G...
Building an Effective Data & Analytics Operating Model A Data Modernization G...Building an Effective Data & Analytics Operating Model A Data Modernization G...
Building an Effective Data & Analytics Operating Model A Data Modernization G...
 
Impact of BIG Data on MDM
Impact of BIG Data on MDMImpact of BIG Data on MDM
Impact of BIG Data on MDM
 
2. Getvisibility. Innovative data governance, control & oversight
2. Getvisibility. Innovative data governance, control & oversight2. Getvisibility. Innovative data governance, control & oversight
2. Getvisibility. Innovative data governance, control & oversight
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
Noise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in DataNoise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in Data
 
Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...
Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...
Slides: Accelerate and Assure the Adoption of Cloud Data Platforms Using Inte...
 
Predictive vs Prescriptive Analytics
Predictive vs Prescriptive AnalyticsPredictive vs Prescriptive Analytics
Predictive vs Prescriptive Analytics
 
Emerging Trends in Data Architecture – What’s the Next Big Thing
Emerging Trends in Data Architecture – What’s the Next Big ThingEmerging Trends in Data Architecture – What’s the Next Big Thing
Emerging Trends in Data Architecture – What’s the Next Big Thing
 
Slides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data GovernanceSlides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data Governance
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
DAS Slides: Data Architect vs. Data Engineer vs. Data Modeler
DAS Slides: Data Architect vs. Data Engineer vs. Data ModelerDAS Slides: Data Architect vs. Data Engineer vs. Data Modeler
DAS Slides: Data Architect vs. Data Engineer vs. Data Modeler
 

Similar to Applying Data Quality Best Practices at Big Data Scale

Emerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big DataEmerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big Data
DATAVERSITY
 
Emerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big DataEmerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big Data
Precisely
 
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackYour AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Precisely
 
Enabling Success With Big Data - Driven Talent Acquisition
Enabling Success With Big Data - Driven Talent AcquisitionEnabling Success With Big Data - Driven Talent Acquisition
Enabling Success With Big Data - Driven Talent Acquisition
David Bernstein
 
Information Governance: Reducing Costs and Increasing Customer Satisfaction
Information Governance: Reducing Costs and Increasing Customer SatisfactionInformation Governance: Reducing Costs and Increasing Customer Satisfaction
Information Governance: Reducing Costs and Increasing Customer SatisfactionCapgemini
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
Vipul Kalamkar
 
Creating Big Data Success with the Collaboration of Business and IT
Creating Big Data Success with the Collaboration of Business and ITCreating Big Data Success with the Collaboration of Business and IT
Creating Big Data Success with the Collaboration of Business and IT
Edward Chenard
 
Barry Ooi; Big Data lookb4YouLeap
Barry Ooi; Big Data lookb4YouLeapBarry Ooi; Big Data lookb4YouLeap
Barry Ooi; Big Data lookb4YouLeapBarry Ooi
 
CRM is not enough
CRM is not enoughCRM is not enough
CRM is not enough
Segment
 
Big data for small businesses
Big data for small businessesBig data for small businesses
Big data for small businesses
Tabor Consulting
 
Data foundation for analytics excellence
Data foundation for analytics excellenceData foundation for analytics excellence
Data foundation for analytics excellence
Mudit Mangal
 
Data mining
Data miningData mining
Data mining
Gagan Mittal
 
uae views on big data
  uae views on  big data  uae views on  big data
uae views on big data
Aravindharamanan S
 
How to get started in extracting business value from big data 1 of 2 oct 2013
How to get started in extracting business value from big data 1 of 2 oct 2013How to get started in extracting business value from big data 1 of 2 oct 2013
How to get started in extracting business value from big data 1 of 2 oct 2013Jaime Nistal
 
Big Data for the Retail Business I Swan Insights I Solvay Business School
Big Data for the Retail Business I Swan Insights I Solvay Business SchoolBig Data for the Retail Business I Swan Insights I Solvay Business School
Big Data for the Retail Business I Swan Insights I Solvay Business School
Laurent Kinet
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallTrillium Software
 
Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & Approaches
DATAVERSITY
 
The Data Driven Enterprise - Roadmap to Big Data & Analytics Success
The Data Driven Enterprise - Roadmap to Big Data & Analytics SuccessThe Data Driven Enterprise - Roadmap to Big Data & Analytics Success
The Data Driven Enterprise - Roadmap to Big Data & Analytics Success
BigInsights
 
Driving Business Performance with effective Enterprise Information Management
Driving Business Performance with effective Enterprise Information ManagementDriving Business Performance with effective Enterprise Information Management
Driving Business Performance with effective Enterprise Information Management
Ray Bachert
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics
Capgemini
 

Similar to Applying Data Quality Best Practices at Big Data Scale (20)

Emerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big DataEmerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big Data
 
Emerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big DataEmerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big Data
 
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackYour AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
 
Enabling Success With Big Data - Driven Talent Acquisition
Enabling Success With Big Data - Driven Talent AcquisitionEnabling Success With Big Data - Driven Talent Acquisition
Enabling Success With Big Data - Driven Talent Acquisition
 
Information Governance: Reducing Costs and Increasing Customer Satisfaction
Information Governance: Reducing Costs and Increasing Customer SatisfactionInformation Governance: Reducing Costs and Increasing Customer Satisfaction
Information Governance: Reducing Costs and Increasing Customer Satisfaction
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
Creating Big Data Success with the Collaboration of Business and IT
Creating Big Data Success with the Collaboration of Business and ITCreating Big Data Success with the Collaboration of Business and IT
Creating Big Data Success with the Collaboration of Business and IT
 
Barry Ooi; Big Data lookb4YouLeap
Barry Ooi; Big Data lookb4YouLeapBarry Ooi; Big Data lookb4YouLeap
Barry Ooi; Big Data lookb4YouLeap
 
CRM is not enough
CRM is not enoughCRM is not enough
CRM is not enough
 
Big data for small businesses
Big data for small businessesBig data for small businesses
Big data for small businesses
 
Data foundation for analytics excellence
Data foundation for analytics excellenceData foundation for analytics excellence
Data foundation for analytics excellence
 
Data mining
Data miningData mining
Data mining
 
uae views on big data
  uae views on  big data  uae views on  big data
uae views on big data
 
How to get started in extracting business value from big data 1 of 2 oct 2013
How to get started in extracting business value from big data 1 of 2 oct 2013How to get started in extracting business value from big data 1 of 2 oct 2013
How to get started in extracting business value from big data 1 of 2 oct 2013
 
Big Data for the Retail Business I Swan Insights I Solvay Business School
Big Data for the Retail Business I Swan Insights I Solvay Business SchoolBig Data for the Retail Business I Swan Insights I Solvay Business School
Big Data for the Retail Business I Swan Insights I Solvay Business School
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They Fall
 
Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & Approaches
 
The Data Driven Enterprise - Roadmap to Big Data & Analytics Success
The Data Driven Enterprise - Roadmap to Big Data & Analytics SuccessThe Data Driven Enterprise - Roadmap to Big Data & Analytics Success
The Data Driven Enterprise - Roadmap to Big Data & Analytics Success
 
Driving Business Performance with effective Enterprise Information Management
Driving Business Performance with effective Enterprise Information ManagementDriving Business Performance with effective Enterprise Information Management
Driving Business Performance with effective Enterprise Information Management
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics
 

More from Precisely

AI-Ready Data - The Key to Transforming Projects into Production.pptx
AI-Ready Data - The Key to Transforming Projects into Production.pptxAI-Ready Data - The Key to Transforming Projects into Production.pptx
AI-Ready Data - The Key to Transforming Projects into Production.pptx
Precisely
 
Building a Multi-Layered Defense for Your IBM i Security
Building a Multi-Layered Defense for Your IBM i SecurityBuilding a Multi-Layered Defense for Your IBM i Security
Building a Multi-Layered Defense for Your IBM i Security
Precisely
 
Optimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdf
Optimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdfOptimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdf
Optimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdf
Precisely
 
Chaining, Looping, and Long Text for Script Development and Automation.pdf
Chaining, Looping, and Long Text for Script Development and Automation.pdfChaining, Looping, and Long Text for Script Development and Automation.pdf
Chaining, Looping, and Long Text for Script Development and Automation.pdf
Precisely
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Precisely
 
Navigating the Cloud: Best Practices for Successful Migration
Navigating the Cloud: Best Practices for Successful MigrationNavigating the Cloud: Best Practices for Successful Migration
Navigating the Cloud: Best Practices for Successful Migration
Precisely
 
Unlocking the Power of Your IBM i and Z Security Data with Google Chronicle
Unlocking the Power of Your IBM i and Z Security Data with Google ChronicleUnlocking the Power of Your IBM i and Z Security Data with Google Chronicle
Unlocking the Power of Your IBM i and Z Security Data with Google Chronicle
Precisely
 
How to Build Data Governance Programs That Last - A Business-First Approach.pdf
How to Build Data Governance Programs That Last - A Business-First Approach.pdfHow to Build Data Governance Programs That Last - A Business-First Approach.pdf
How to Build Data Governance Programs That Last - A Business-First Approach.pdf
Precisely
 
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter MassendatenZukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
Precisely
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
Precisely
 
Crucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfCrucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdf
Precisely
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Precisely
 
Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10
Precisely
 
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Precisely
 
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Precisely
 
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3fTestjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Precisely
 
Data Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity TrendsData Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity Trends
Precisely
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
Precisely
 
Optimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAPOptimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAP
Precisely
 
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige InvestitionenSAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
Precisely
 

More from Precisely (20)

AI-Ready Data - The Key to Transforming Projects into Production.pptx
AI-Ready Data - The Key to Transforming Projects into Production.pptxAI-Ready Data - The Key to Transforming Projects into Production.pptx
AI-Ready Data - The Key to Transforming Projects into Production.pptx
 
Building a Multi-Layered Defense for Your IBM i Security
Building a Multi-Layered Defense for Your IBM i SecurityBuilding a Multi-Layered Defense for Your IBM i Security
Building a Multi-Layered Defense for Your IBM i Security
 
Optimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdf
Optimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdfOptimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdf
Optimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdf
 
Chaining, Looping, and Long Text for Script Development and Automation.pdf
Chaining, Looping, and Long Text for Script Development and Automation.pdfChaining, Looping, and Long Text for Script Development and Automation.pdf
Chaining, Looping, and Long Text for Script Development and Automation.pdf
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
 
Navigating the Cloud: Best Practices for Successful Migration
Navigating the Cloud: Best Practices for Successful MigrationNavigating the Cloud: Best Practices for Successful Migration
Navigating the Cloud: Best Practices for Successful Migration
 
Unlocking the Power of Your IBM i and Z Security Data with Google Chronicle
Unlocking the Power of Your IBM i and Z Security Data with Google ChronicleUnlocking the Power of Your IBM i and Z Security Data with Google Chronicle
Unlocking the Power of Your IBM i and Z Security Data with Google Chronicle
 
How to Build Data Governance Programs That Last - A Business-First Approach.pdf
How to Build Data Governance Programs That Last - A Business-First Approach.pdfHow to Build Data Governance Programs That Last - A Business-First Approach.pdf
How to Build Data Governance Programs That Last - A Business-First Approach.pdf
 
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter MassendatenZukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Crucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfCrucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10
 
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
 
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
 
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3fTestjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
 
Data Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity TrendsData Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity Trends
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Optimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAPOptimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAP
 
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige InvestitionenSAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
 

Recently uploaded

Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

Applying Data Quality Best Practices at Big Data Scale

  • 1. Applying Data Quality Best Practices at Big Data Scale Harald Smith Director of Product Management Michael Urbonas Director of Product Marketing
  • 2. Speakers Mike Urbonas Director of Product Marketing, Syncsort 15 years of software experience including – BI/DW & data visualization – Data management & ETL – Text analytics – Enterprise search Harald Smith Director of Product Management, Syncsort 20 years in Information Management incl. data quality, integration, and governance – Consulting, product management, software & solution development Co-author of Patterns of Information Management, as well as two Redbooks on Information Governance and Data Integration
  • 3. Today’s agenda Problem: Huge Big Data investments, Scarce Big Data trust – New insights from Syncsort 2017 Big Data Trends survey – Root causes of Data Lake distrust Sample use cases at Big Data scale – 360 degree view of the customer, product or other core entity – Anti-fraud Solution: Bringing Data Quality best practices into the Data Lake – “Design once, deploy anywhere” with Syncsort/Trillium technology approach – “Intelligent execution” to leverage the strength of the Big Data platform
  • 4. Nobody wants a data swamp! “This sure looked a lot nicer on the whiteboard…”
  • 5. Key problem: Big Data deemed untrustworthy by business managers and leaders Only 33% of senior execs have a high level of trust in the accuracy of their Big Data analytics ~ KPMG 2016
  • 6. Key problem: Big Data deemed untrustworthy by business managers and leaders Only 33% of senior execs have a high level of trust in the accuracy of their Big Data analytics ~ KPMG 2016 59% of global execs do not believe their company has capabilities to generate meaningful business insights from their data ~ Bain 2015
  • 7. Key problem: Big Data deemed untrustworthy by business managers and leaders Only 33% of senior execs have a high level of trust in the accuracy of their Big Data analytics ~ KPMG 2016 85% of global execs say major investments are required to update their existing data platform, including data cleaning and consolidating ~ Bain 2015 59% of global execs do not believe their company has capabilities to generate meaningful business insights from their data ~ Bain 2015
  • 8. Fresh insights from Syncsort 2017 Big Data Trends survey Data Quality is recognized as a mission-critical data lake success factor – Data Quality tops the list of challenges of data lake implementation, followed closely by Data Governance
  • 9. Fresh insights from Syncsort 2017 Big Data Trends survey Data Quality is recognized as a mission-critical data lake success factor – Data Quality tops the list of challenges of data lake implementation, followed closely by Data Governance Financial services and insurance industry is most focused on Data Quality and Data Governance – Named Data Quality as top priority 50% more often than participants from other industries – Also identified Data Governance as a top priority at more than twice the rate of those from other industries
  • 10. Fresh insights from Syncsort 2017 Big Data Trends survey Data Quality is recognized as a mission-critical data lake success factor – Data Quality tops the list of challenges of data lake implementation, followed closely by Data Governance But… not everyone is making the connection between Data Quality and Big Data success – Participants who did not include data quality as a top 3 priority for implementing the data lake expressed the most interest in analytically- intensive data lake uses… which are highly dependent on proper data quality Financial services and insurance industry is most focused on Data Quality and Data Governance – Named Data Quality as top priority 50% more often than participants from other industries – Also identified Data Governance as a top priority at more than twice the rate of those from other industries
  • 11. Root causes of Big Data mistrust
  • 12. Root causes of Big Data mistrust Are these numbers accurate? Are calculations using correctly aggregated data? Is this data current? When was it last updated? Are these terms consistent with our business definitions? Can I trust this data enough to make key decisions and/or allow the data to be used in real-time? Did we include all of the data we should have? Are additional data sources missing?
  • 13. Root causes of Big Data mistrust… examples False Assumptions Pinterest targeted marketing campaign mistakenly congratulated single women on upcoming weddings...
  • 14. Root causes of Big Data mistrust… examples False Assumptions Pinterest targeted marketing campaign mistakenly congratulated single women on upcoming weddings... Miscoded/ Misinterpreted Data Predictive analysis falsely found call center workers without a HS diploma were 3x more likely to remain on board for at least 9 months… …?
  • 15. Root causes of Big Data mistrust… examples False Assumptions Pinterest targeted marketing campaign mistakenly congratulated single women on upcoming weddings... Duplicate Data Fraud examination revealed massive import tariff evasion on eggs, only to find there was no case to crack… Miscoded/ Misinterpreted Data Predictive analysis falsely found call center workers without a HS diploma were 3x more likely to remain on board for at least 9 months… …?
  • 16. Sample use cases at Big Data scale… 360 view of customer (or product, or other key entity) Is Data Lake essential for this use case? – YES… Purpose of customer 360 is to optimize customer experience management – Increasingly broad spectrum of data sources involved in and required for effectively personalizing customer experiences and targeted marketing offers What Types of Data? – Internal sources – often many/overlapping – 3rd Party data – demographics – Suppression data – keeping customer information updated – New sources – mobile, social media Internal Data  Customer Master Data  Point-of-Sale Data  Contact Form Data  Loyalty Program Data  ecommerce Data  Customer Service Data Suppression Data  Change of Address  Mortality  Do Not Call Third-Party Data  Age  Occupation  Education  Gender  Income  Geographic
  • 17. Sample use cases at Big Data scale… Anti-Fraud/Anti-Money Laundering Is Data Lake essential for this use case? – YES… Fraudulent transaction detection requires huge volumes of customer profile data, recent transaction activity with “last known” values, device data with geolocation and time-based tagging, 3rd party news/alerts – Data used to refine Machine Learning models (e.g., anomaly detection, implausible behavior analysis) to review new transactions in real time What Types of Data? – Internal sources – often many/overlapping – Suppression data – keeping customer information updated – Mobile data – devices, locations – New sources – social media, 3rd party data, … Internal Data  Customer Master Data  Point-of-Sale Data  Contact Form Data  Loyalty Program Data  ecommerce Data  Customer Service Data Mobile Data  Device  Location  Wearables  Mobile wallets Suppression Data  Change of Address  Mortality  Do Not Call Social Data  Sentiment  Opinions  Interests  Social handles
  • 18. The Fundamental Data Quality Question: What are you trying to do? “Never lead with a data set; lead with a question.” Anthony Scriffignano Chief Data Scientist, Dun & Bradstreet Forbes Insights, May 31, 2017 “The Data Differentiator” “If you don’t know what you want to get out of the data, how can you know what data you need – and what insight you’re looking for?” Wolf Ruzicka Chairman of the Board at EastBanc Technologies Blog post: June 1, 2017 “Grow A Data Tree Out Of The “Big Data” Swamp”
  • 19. Understanding Data Quality best practices: Where to start? Establishing Scope Asking the “right questions” about your data (not just “what” and “how”) – “Why” questions to understand core business problem – “Who” questions to understand varying needs of all involved users (role, function, etc.) Empowering users (“Who”) to gain new clarity into the core problem (“Why”) – Bringing together data sources relevant to asking insightful questions of the data – Enabling the data to answer the questions freely – Building data analytics, algorithms, machine learning, etc. to expedite and broadcast answers Above lines of inquiry inform what Data Quality processing is required – Determining how, what and where Data Quality is established based on business problem – “High-quality data” definition will vary by business problem
  • 20. Understanding Data Quality best practices: What’s the End Goal? The End Goal drives Data Quality Requirements & Processes Do you have all the data required? – What’s the central entity? E.g. Customer, Product, Asset – What’s the definition? E.g. “Customers” may mean customers, prospects, store visitors, … – Are the sources comprehensive? E.g. any data silos? cover all geographies? – Will “new” information be added? E.g. demographics, geolocation, … How will data be matched, consolidated, or connected? – One “golden” record? Or multiple links to connect all the dots? What’s needed to facilitate the matching, consolidation, or connection required? – E.g. Customer may need: Name, Address, Geolocation, Phone, Email Have you evaluated the sources? – Are the data sources “Fit for Purpose”?
  • 21. Applying Data Quality best practices: Identifying required Data Quality dimensions What data do we care about? • What are the Critical Data Elements? What measures can we take advantage of? 1) Completeness – Are the relevant fields populated? 2) Integrity – Does the data maintain an internal structural integrity or a relational integrity across sources 3) Uniqueness – Are keys or records unique? 4) Validity – Does the data have the correct values? 5) Consistency – Is the data at consistent levels of aggregation or does it have consistent valid values over time? 6) Timeliness – Did the data arrive in a time period that makes it useful or usable?
  • 23. Example: Call Center Record Unique  Integrity  Complete ? Consistent  Timely  Valid ? Is Duration = 0 important? Is 01/01/20xx a defaulted date? And how will this be linked or connected with my other data? The file appears complete, but does it cover all call centers?
  • 26. What else can we review or measure? 1) Coverage (Relevance) – How well does the data source meet the defined needs? – E.g. does it cover the relevant geography? Is it biased? 2) Continuity – Data points for all intervals or expected intervals? – E.g. sensors, weather records, call data records 3) Triangulation – What Gartner describes as ‘consistency of data across proximate data points’, i.e. consistent measurements from related points of reference. – E.g. if temperatures in Chicago and Louisville are 30°and 32°then temperature in Indianapolis for same day is unlikely to be 70° 4) Provenance – Where did the data originate, who gathered it, and what criteria was used to create it? – E.g. government agency, 3rd party provider, free or paid data 5) Transformation from origin – how many layers and/or changes has the data passed through? – E.g. has the original data source already been merged with two other record sources? And is the result accurate? 6) Repetition or duplication of data patterns – Data points exactly the same across multiple recording intervals or across multiple sensors. – E.g. is there tampering with sensors or call data? Applying Data Quality best practices: ‘New’ or ‘Extended’ Measures of Data Quality
  • 28. Example: Twitter Feed Triangulated Continuity Provenance Coverage Usage Repeated patterns Transformation Jane Doe pulled from Twitter based on #Blackberry All items for #Blackberry in time interval appear to be included Marketing confirms these have high value Good association with current sales data All tweets appear unique within the date & vs. prior feeds This needed to include #BB and #Crackberry as well!
  • 29. Applying Data Quality best practices: Understanding Context Context is critical: Even on data that is considered “common” or “understood” such as Name or Address or Product Description To parse or standardize data to useful and usable components for additional processing To determine when and where to verify or enrich the data content To determine whether and how to match records to a given entity To identify whether to consolidate data, and if so what other data drives the consolidation
  • 30. Applying Data Quality best practices: Assessing Quality Requirements Entity data (customer, product, asset, …): Requires understanding data provenance and context Requires integrating data from multiple data sources Requires determining whether specific data should even be included Presents differences in coverage, completeness, consistency, provenance, … Comes from different points in time May contain repetitions, particularly from 3rd Party data sources May contain data at different levels of consolidation or aggregation Robert Smith Jr 3 Davy Drive S66 7EN bsmith850@gmail.com +44(0)1189 823606 Rotherham Name Address City Postal Code Phone Email 3rd Party
  • 31. Applying Data Quality best practices: Utilizing Data Quality functions to achieve required DQ dimensions Parse data values from unstructured fields to their correct domains Standardize values to enable higher quality matching and linkage Verify and enrich global postal addresses and geolocations Enrich data from external, third-party sources to create comprehensive, unified records Match and link like records Consolidate and aggregate to “golden” record, if appropriate, based on factors such as data source, date, … Match records that belong to the same domain (i.e., household or business) Smith 3 Davy Drive S66 7EN Rotherham Name Address City Postal Code Household View
  • 32. Applying Data Quality best practices: Example Large telco organization: “What are our customers saying about us in the marketplace? Where are the most common complaints are coming from? Issue: sparse results concentrated in one region Required: standardization, enrichment, geocoding, matching/record linkage, address verification Before Data Quality After Data Quality
  • 33. Applying Data Quality best practices: Example Large telco organization: “What are our most profitable regions on a daily basis? Which are the most profitable regions? Issue: poor geolocation identifying wrong regions Required: parsing, standardization, address verification, enrichment, geocoding, matching/record linkage Before Data Quality After Data Quality
  • 34. Applying Data Quality best practices: Consistent processing Big Data at scale distributes data across many nodes – not necessarily with other relevant data! – Implications for joining, sorting, and matching data, whether for enrichment, verification against trusted sources, or a consolidated single view Data Quality functions must be performed in a consistent manner, no matter where actual processing takes place, how the data is segmented, and what the data volume is – Processing routines must apply same approach and logic each time – Critical to establishing, building, and maintaining trust Source: HP Analyst Briefing
  • 35. Trillium Quality for Big Data Focus on Data Quality, not the Big Data platform • Use existing Data Quality skills and expertise • No need to worry about mappers, reducers, big side or small side of joins, etc • Automatic optimization for best performance, load balancing, etc. • No changes or tuning required, even if you change execution frameworks • Future-proof job designs for emerging compute frameworks, e.g. Spark 2.x • Run multiple execution frameworks in a single job Single GUI Execute Anywhere! 35Syncsort Confidential and Proprietary - do not copy or distribute Intelligent Execution - Insulate your organization from underlying complexities of Hadoop
  • 36. Bring Data Quality best practices into the Data Lake: The Syncsort/Trillium technology approach “Design once, deploy anywhere” – Visually design data quality jobs once and run anywhere (MapReduce, Spark, Linux, Unix, Windows; on premise or in the cloud) – Use-case templates to fast-track development – Test & debug locally in Windows/Linux; deploy to Big Data – Intelligent Execution dynamically optimizes data processing at run-time based on the chosen compute framework; no changes or tuning required Benefit: Significantly reduce manual data preparation – Major time sink for data scientists, architects and analysts – Risk of inconsistent or incomplete data preparation Benefit: Significantly increase trust in data – Major time sink for executives – Risk of poor data-based business decisions Single GUI Execute Anywhere!
  • 37. “Data is useful. High-quality, well-understood, auditable data is priceless.” Ted Friedman VP Distinguished Analyst Article in CRM.com: Mar 8, 2005 “The Coming of BI Competency Centers” Data Quality remains Data Quality, even at scale “Data is the new science. Big Data holds the answers. Are you asking the right questions?” Pat Gelsinger President and COO at EMC Forbes Insights, June 22, 2012 “Big Bets On Big Data”
  • 38. Questions and Next Steps For more information on Trillium Quality for Big Data, visit: trilliumsoftware.com/products/big-data Contact Info: Mike Urbonas, Director of Product Marketing, Syncsort/Trillium Software murbonas@syncsort.com https://www.linkedin.com/in/mikeu Harald Smith, Director of Product Management, Syncsort/Trillium Software harald_smith@trilliumsoftware.com https://www.linkedin.com/in/harald-smith-71028b