SlideShare a Scribd company logo
1 of 35
Data Quality - Testing
Vijaya Kokkili
Director of Quality
CommerceHub
MEEEEā€¦ā€¦ā€¦ ļŠ Overcome the fear!!!
Gardening
Adrenaline Junkie
MEEEEā€¦ā€¦ā€¦ ļŠ Still to comeā€¦ā€¦.
Agenda:
Data Quality Data Quality Testing
World trending towardsā€¦.. How to test data quality
Facts about data quality Data quality test management
Most common business problems Data quality testing challenges
Business benefits Data quality testing best practices
What is data quality?
Dimensions of data quality
Definitions of dimensions
Real time situation
Measuring data quality
Data profiling analysis
When and how to conduct data
profiling
ā€¢ Operating systems
ā€¢ Mobile platforms
ā€¢ Software frameworks
ā€¢ Hardware
ā€¢ Software
Few facts about data quality:
ā— Cost of poor data quality in US - $600 Billion
ā— Poor Data/Lack of visibility cited as #1 reason for project cost overruns
ā— Poor data quality costs the US Economy $3.1 Trillion a year
ā— Implementing data quality best practices boosts revenue by 66%
ā— Median Fortune 1000 company could increase revenue by $2.01 Billion if they improved
usability of data by 10%
Most common business problems
ā€¢ Billing and payment errors causing negative customer perceptions
ā€¢ Operating expenses are inflated
ā€¢ Regulatory fines are levied due to inaccurate reporting of data to government entities
ā€¢ Customers and revenue are lost due to an inability to track customer interactions or to
recognize high-value customers
ā€¢ Disruption of service
ā€¢ Flawed analytics lead to poor tactical and strategic directions
ā€¢ Extra time on IT projects to reconcile data
ā€¢ Delays in deploying new systems
Business benefits:
ā€¢ Customer satisfaction
ā€¢ Strengthens trust and collaboration between trading partners
ā€¢ Increases supply chain efficiencies and cuts costs by reducing errors
ā€¢ Cuts delays at point-of-sale as a result of reduced measurement errors
ā€¢ Increases reliability and efficiency
ā€¢ Ensures better compliance
What is data quality?
Data quality is a perception or an assessment of dataā€™s fitness to serve its
purpose in a given context
Dimensions of data quality:
ā— Consistency
ā— Accuracy
ā— Correctness
ā— Objectivity
ā— Timeliness
ā— Conciseness
ā— Precision
ā— Usefulness
ā— Unamiguous
ā— Usability
ā— Completeness
ā— Relevance
ā— Reliability
ā— Amount of data
Definitions of data quality dimensions:
ā€¢Correctness / Accuracy: Accuracy of data is the degree to which the
captured data correctly describes the real world entity.
ā€¢Consistency: This is about the single version of truth. Consistency means data
throughout the enterprise should be sync with each other.
ā€¢Completeness: It is the extent to which the expected attributes of data are
provided.
ā€¢Timeliness: Right data to the right person at the right time is important for
business.
Definitions of data quality dimensions:
ā€¢Correctness / Accuracy:
Accuracy of data is the degree to which the captured data correctly describes
the real world entity.
Ability to draw correct conclusions from data
Business process that match reality
Eg of data accuracy issues:
ā€¢ An incident reported with $23M when the loss was $12k
ā€¢ The amount invoiced does not represent the customerā€™s usage
Definitions of data quality dimensions:
ā€¢Consistency: This is about the single version of truth. Consistency means
throughout the enterprise should be sync with each other.
Ability to trust data regardless of source
Identical information available to all processes and units
Eg of data consistency issues:
ā€¢ Mr.A defines ā€œreprocessingā€ as cancel/total and Mr. B as Cancel/new.
Definitions of data quality dimensions:
ā€¢Completeness: It is the extent to which the expected attributes of data are
provided.
Data that does not leave any open questions
Ability to make a good decision based on available data
Closeness between ā€œneed to knowā€ and what data tells you
Eg of data completeness issues:
ā€¢ We cannot tell how many cell phone contracts Mr. X has
ā€¢ A summary report includes projects that did not report status!
Definitions of data quality dimensions:
ā€¢Timeliness: Right data to the right person at the right time is important for
business.
Data that is available without delay
Ability to know what you need, when you need
Smooth information flow: ā€œData delayed is Data denied!ā€
Eg of data timeliness issues:
ā€¢ Receiving a ā€œbudget exceededā€ SMS after you went over the limit!
Real time situation
Many database professionals face situations like:
1. Several data inconsistencies in source, like missing records or NULL values.
2. column they chose to be the primary key column is not unique throughout the table.
3. Schema design is not coherent to the end user requirement.
4. Any other concern with the data, that must have been fixed right at the beginning
What does it mean to fix data quality issues?
Make changes in ETL data flow packages, cleaning identified inconsistencies etc..
Lot of re-work to be done
Added costs in terms of time and effort
Soā€¦..
What is the solution???
Solution
ā€œPREVENTION IS BETTER THAN CUREā€
Hence data profiling comes to the rescue
Measuring Data Quality
Profiling ā€“ Understand metadata
ā€¢ Point of time shows what data looks like now
ā€¢ Automating shows trends
o Alert to new/potential issues as they happen
o Potentially fix issues in near real time
Statistical process control
Automated inspection
Visibility shows process deviation
Data profiling analysis
Duplication
Pattern matching
Day of week
Character set
Reference data matching
Inter-data set comparisons
Master data management
Create a standard for data
Distribute data so that all sources are uniform
ā€¢ Names
ā€¢ Addresses
ā€¢ Phone numbers
ā€¢ Products
Can hook into 3rd party sources
Data Governance
Central authority for data quality control
Applies information collected from data profiling uniformly across the business
Communication channels between business and IT groups
Maintenance of data quality
Data quality results from the process of going through the data and scrubbing it,
standardizing it, and removing duplicate records, as well as doing some of the data
enrichment.
1. Maintain complete data
2. Clean up data by standardizing using rules
3. Using algorithms to detect duplicates
4. Avoid entry of duplicate leads and contacts
5. Merge existing duplicate records
6. Use roles for security
Inconsistent data before cleaning up
Bill no CustomerName SSN
101 Ms Vijaya Kokkili SSN100123
Bill no CustomerName SSN
204 Ms V Kokkili SSN100123
Bill no CustomerName SSN
354 Ms Kokkili Vijaya SSN100123
Bill no CustomerName SSN
467 Ms Vijaya K SSN100123
Invoice 1
Invoice 2
Invoice 3
Invoice 4
Consistent data after cleaning up
Bill no CustomerName SSN
101 Ms Vijaya Kokkili SSN100123
Bill no CustomerName SSN
204 Ms Vijaya Kokkili SSN100123
Bill no CustomerName SSN
354 Ms Vijaya Kokkili SSN100123
Bill no CustomerName SSN
467 Ms Vijaya Kokkili SSN100123
Invoice 1
Invoice 2
Invoice 3
Invoice 4
When and how to conduct data profiling?
Generally, data profiling is conducted in two ways:
1.Writing SQL queries on sample data extracts put into a database.
2.Using data profiling tools
When to conduct Data profiling?
At the discovery/requirements gathering phase
How to conduct data profiling?
Data profiling involves statistical analysis of the data at source and the data being loaded, as well as
analysis of metadata. These statistics may be used for various analysis purposes. Common examples
of analyses to be done are:
Data quality: Analyze the quality of data at the data source.
NULL values: Look out for the number of NULL values in an attribute
Candidate keys: Analysis of the extent to which certain columns are distinct will give developer
useful information w. r. t. selection of candidate keys.
Primary key selection: To check whether the candidate key column does not violate the basic
requirements of not having NULL values or duplicate values.
Empty string values: A string column may contain NULL or even empty sting values that may create
problems later.
String length: An analysis of largest and shortest possible length as well as the average string length
of a sting-type column can help us decide what data type would be most suitable for the said column
How to test for Data quality?
Discrepancy in
records count at
Source & target
When all data is at
source is present at
target
Ensure that source &
target donā€™t contain
conflicting facts
Degree of conformance
of data to its domain
and business values
Physical and logical
duplicates
Orphan records in
targets when no
corresponding parent
records
List of valid/invalid
values that are allowed
along with ranges, look
up etc
Degree to which
data reflects the
real world objects
Describes the
relevance &
meaning of data
Describes
availability of data
as per SLA
Row Count Completeness Consistency
Validity Redundancy Referential Integrity
Domain Integrity Accuracy Usability Timeliness
Data quality test management
Test planning Test design Test Execution Test monitoring
Requirements:
ā€¢ BRD
ā€¢ FSD
ā€¢ Test Plan
Requirements:
ā€¢ Test
scenarios
ā€¢ Test cases
ā€¢ Automated
Requirements:
ā€¢ Executed in
test cycles
ā€¢ Test
results/bugs
are shared
with
business
ā€¢ Prioritize
Requirements:
ā€¢ Collect
metrics
ā€¢ Observe
trend
Data quality testing challenges
ā€¢ Lack of tools
ā€¢ Lack of domain knowledge
ā€¢ Changing requirements
ā€¢ Poor planning for data quality in initial phase of the application
Data quality testing best practices
ā€¢ Understand user business
ā€¢ Plan early in Design and testing phase
ā€¢ Be proactive when it comes to data growth/trending
ā€¢ Donā€™t assume! Understand data!
Q & A
@vkokkili
vkokkili@gmail.com

More Related Content

What's hot

Who Should Own Data Governance ā€“ IT or Business?
Who Should Own Data Governance ā€“ IT or Business?Who Should Own Data Governance ā€“ IT or Business?
Who Should Own Data Governance ā€“ IT or Business?DATAVERSITY
Ā 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best PracticesDATAVERSITY
Ā 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data GovernanceTuba Yaman Him
Ā 
Data Quality Management: Cleaner Data, Better Reporting
Data Quality Management: Cleaner Data, Better ReportingData Quality Management: Cleaner Data, Better Reporting
Data Quality Management: Cleaner Data, Better Reportingaccenture
Ā 
Data Governance
Data GovernanceData Governance
Data GovernanceBoris Otto
Ā 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality DashboardsWilliam Sharp
Ā 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
Ā 
Data Governance
Data GovernanceData Governance
Data GovernanceSambaSoup
Ā 
Data Quality Strategies
Data Quality StrategiesData Quality Strategies
Data Quality StrategiesDATAVERSITY
Ā 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
Ā 
Introduction to Data Governance
Introduction to Data GovernanceIntroduction to Data Governance
Introduction to Data GovernanceJohn Bao Vuu
Ā 
Data Quality for Non-Data People
Data Quality for Non-Data PeopleData Quality for Non-Data People
Data Quality for Non-Data PeopleDATAVERSITY
Ā 
DAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDATAVERSITY
Ā 
Data Modeling & Metadata Management
Data Modeling & Metadata ManagementData Modeling & Metadata Management
Data Modeling & Metadata ManagementDATAVERSITY
Ā 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management DATAVERSITY
Ā 
DAS Slides: Data Governance - Combining Data Management with Organizational ...
DAS Slides: Data Governance -  Combining Data Management with Organizational ...DAS Slides: Data Governance -  Combining Data Management with Organizational ...
DAS Slides: Data Governance - Combining Data Management with Organizational ...DATAVERSITY
Ā 
Data quality architecture
Data quality architectureData quality architecture
Data quality architectureanicewick
Ā 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceDATAVERSITY
Ā 
Data quality management Basic
Data quality management BasicData quality management Basic
Data quality management BasicKhaled Mosharraf
Ā 

What's hot (20)

Who Should Own Data Governance ā€“ IT or Business?
Who Should Own Data Governance ā€“ IT or Business?Who Should Own Data Governance ā€“ IT or Business?
Who Should Own Data Governance ā€“ IT or Business?
Ā 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best Practices
Ā 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data Governance
Ā 
Data Quality Management: Cleaner Data, Better Reporting
Data Quality Management: Cleaner Data, Better ReportingData Quality Management: Cleaner Data, Better Reporting
Data Quality Management: Cleaner Data, Better Reporting
Ā 
Data Governance
Data GovernanceData Governance
Data Governance
Ā 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality Dashboards
Ā 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
Ā 
Data Governance
Data GovernanceData Governance
Data Governance
Ā 
Data Quality Strategies
Data Quality StrategiesData Quality Strategies
Data Quality Strategies
Ā 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
Ā 
Introduction to Data Governance
Introduction to Data GovernanceIntroduction to Data Governance
Introduction to Data Governance
Ā 
Data Quality for Non-Data People
Data Quality for Non-Data PeopleData Quality for Non-Data People
Data Quality for Non-Data People
Ā 
DAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best Practices
Ā 
Data Modeling & Metadata Management
Data Modeling & Metadata ManagementData Modeling & Metadata Management
Data Modeling & Metadata Management
Ā 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management
Ā 
DAS Slides: Data Governance - Combining Data Management with Organizational ...
DAS Slides: Data Governance -  Combining Data Management with Organizational ...DAS Slides: Data Governance -  Combining Data Management with Organizational ...
DAS Slides: Data Governance - Combining Data Management with Organizational ...
Ā 
Data quality architecture
Data quality architectureData quality architecture
Data quality architecture
Ā 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Ā 
Data quality management Basic
Data quality management BasicData quality management Basic
Data quality management Basic
Ā 
Data Governance
Data GovernanceData Governance
Data Governance
Ā 

Similar to Data Quality

Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityPrecisely
Ā 
Data quality testing ā€“ a quick checklist to measure and improve data quality
Data quality testing ā€“ a quick checklist to measure and improve data qualityData quality testing ā€“ a quick checklist to measure and improve data quality
Data quality testing ā€“ a quick checklist to measure and improve data qualityJaveriaGauhar
Ā 
AI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfAI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfarifulislam946965
Ā 
From Compliance to Customer 360: Winning with Data Quality & Data Governance
From Compliance to Customer 360: Winning with Data Quality & Data GovernanceFrom Compliance to Customer 360: Winning with Data Quality & Data Governance
From Compliance to Customer 360: Winning with Data Quality & Data GovernancePrecisely
Ā 
Bad customer data?
Bad customer data?Bad customer data?
Bad customer data?DataValueTalk
Ā 
Data quality
Data qualityData quality
Data qualitysethnainaa
Ā 
Data quality and bi
Data quality and biData quality and bi
Data quality and bijeffd00
Ā 
Your AI and ML Projects Are Failing ā€“ Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing ā€“ Key Steps to Get Them Back on TrackYour AI and ML Projects Are Failing ā€“ Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing ā€“ Key Steps to Get Them Back on TrackPrecisely
Ā 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataPrecisely
Ā 
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Health Catalyst
Ā 
Developing A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product DataDeveloping A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product DataFindWhitePapers
Ā 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury
Ā 
From Asset to Impact - Presentation to ICS Data Protection Conference 2011
From Asset to Impact - Presentation to ICS Data Protection Conference 2011From Asset to Impact - Presentation to ICS Data Protection Conference 2011
From Asset to Impact - Presentation to ICS Data Protection Conference 2011Castlebridge Associates
Ā 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesCarl Anderson
Ā 
Sound Data Quality for CRM
Sound Data Quality for CRMSound Data Quality for CRM
Sound Data Quality for CRMDivya Malik
Ā 
How do you assess the quality and reliability of data sources in data analysi...
How do you assess the quality and reliability of data sources in data analysi...How do you assess the quality and reliability of data sources in data analysi...
How do you assess the quality and reliability of data sources in data analysi...Soumodeep Nanee Kundu
Ā 
Marketsoft and marketing cube data quality to cc-v3
Marketsoft and marketing cube   data quality to cc-v3Marketsoft and marketing cube   data quality to cc-v3
Marketsoft and marketing cube data quality to cc-v3Marketsoft
Ā 
Transform Your Downstream Cloud Analytics with Data QualityĀ 
Transform Your Downstream Cloud Analytics with Data QualityĀ Transform Your Downstream Cloud Analytics with Data QualityĀ 
Transform Your Downstream Cloud Analytics with Data QualityĀ Precisely
Ā 
Data Governance That Drives the Bottom Line
Data Governance That Drives the Bottom LineData Governance That Drives the Bottom Line
Data Governance That Drives the Bottom LinePrecisely
Ā 

Similar to Data Quality (20)

Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
Ā 
Data quality testing ā€“ a quick checklist to measure and improve data quality
Data quality testing ā€“ a quick checklist to measure and improve data qualityData quality testing ā€“ a quick checklist to measure and improve data quality
Data quality testing ā€“ a quick checklist to measure and improve data quality
Ā 
AI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfAI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdf
Ā 
From Compliance to Customer 360: Winning with Data Quality & Data Governance
From Compliance to Customer 360: Winning with Data Quality & Data GovernanceFrom Compliance to Customer 360: Winning with Data Quality & Data Governance
From Compliance to Customer 360: Winning with Data Quality & Data Governance
Ā 
Bad customer data?
Bad customer data?Bad customer data?
Bad customer data?
Ā 
Data quality
Data qualityData quality
Data quality
Ā 
Data quality
Data qualityData quality
Data quality
Ā 
Data quality and bi
Data quality and biData quality and bi
Data quality and bi
Ā 
Your AI and ML Projects Are Failing ā€“ Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing ā€“ Key Steps to Get Them Back on TrackYour AI and ML Projects Are Failing ā€“ Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing ā€“ Key Steps to Get Them Back on Track
Ā 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Ā 
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Ā 
Developing A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product DataDeveloping A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product Data
Ā 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
Ā 
From Asset to Impact - Presentation to ICS Data Protection Conference 2011
From Asset to Impact - Presentation to ICS Data Protection Conference 2011From Asset to Impact - Presentation to ICS Data Protection Conference 2011
From Asset to Impact - Presentation to ICS Data Protection Conference 2011
Ā 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practices
Ā 
Sound Data Quality for CRM
Sound Data Quality for CRMSound Data Quality for CRM
Sound Data Quality for CRM
Ā 
How do you assess the quality and reliability of data sources in data analysi...
How do you assess the quality and reliability of data sources in data analysi...How do you assess the quality and reliability of data sources in data analysi...
How do you assess the quality and reliability of data sources in data analysi...
Ā 
Marketsoft and marketing cube data quality to cc-v3
Marketsoft and marketing cube   data quality to cc-v3Marketsoft and marketing cube   data quality to cc-v3
Marketsoft and marketing cube data quality to cc-v3
Ā 
Transform Your Downstream Cloud Analytics with Data QualityĀ 
Transform Your Downstream Cloud Analytics with Data QualityĀ Transform Your Downstream Cloud Analytics with Data QualityĀ 
Transform Your Downstream Cloud Analytics with Data QualityĀ 
Ā 
Data Governance That Drives the Bottom Line
Data Governance That Drives the Bottom LineData Governance That Drives the Bottom Line
Data Governance That Drives the Bottom Line
Ā 

Recently uploaded

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
Ā 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
Ā 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
Ā 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
Ā 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayMakMakNepo
Ā 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
Ā 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
Ā 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
Ā 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
Ā 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
Ā 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
Ā 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
Ā 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
Ā 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
Ā 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
Ā 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
Ā 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
Ā 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
Ā 

Recently uploaded (20)

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
Ā 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
Ā 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Ā 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
Ā 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
Ā 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up Friday
Ā 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
Ā 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
Ā 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
Ā 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
Ā 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
Ā 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
Ā 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
Ā 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
Ā 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
Ā 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
Ā 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
Ā 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
Ā 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
Ā 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
Ā 

Data Quality

  • 1. Data Quality - Testing Vijaya Kokkili Director of Quality CommerceHub
  • 2. MEEEEā€¦ā€¦ā€¦ ļŠ Overcome the fear!!! Gardening Adrenaline Junkie
  • 4. Agenda: Data Quality Data Quality Testing World trending towardsā€¦.. How to test data quality Facts about data quality Data quality test management Most common business problems Data quality testing challenges Business benefits Data quality testing best practices What is data quality? Dimensions of data quality Definitions of dimensions Real time situation Measuring data quality Data profiling analysis When and how to conduct data profiling
  • 5. ā€¢ Operating systems ā€¢ Mobile platforms ā€¢ Software frameworks ā€¢ Hardware ā€¢ Software
  • 6.
  • 7. Few facts about data quality: ā— Cost of poor data quality in US - $600 Billion ā— Poor Data/Lack of visibility cited as #1 reason for project cost overruns ā— Poor data quality costs the US Economy $3.1 Trillion a year ā— Implementing data quality best practices boosts revenue by 66% ā— Median Fortune 1000 company could increase revenue by $2.01 Billion if they improved usability of data by 10%
  • 8. Most common business problems ā€¢ Billing and payment errors causing negative customer perceptions ā€¢ Operating expenses are inflated ā€¢ Regulatory fines are levied due to inaccurate reporting of data to government entities ā€¢ Customers and revenue are lost due to an inability to track customer interactions or to recognize high-value customers ā€¢ Disruption of service ā€¢ Flawed analytics lead to poor tactical and strategic directions ā€¢ Extra time on IT projects to reconcile data ā€¢ Delays in deploying new systems
  • 9.
  • 10. Business benefits: ā€¢ Customer satisfaction ā€¢ Strengthens trust and collaboration between trading partners ā€¢ Increases supply chain efficiencies and cuts costs by reducing errors ā€¢ Cuts delays at point-of-sale as a result of reduced measurement errors ā€¢ Increases reliability and efficiency ā€¢ Ensures better compliance
  • 11. What is data quality? Data quality is a perception or an assessment of dataā€™s fitness to serve its purpose in a given context
  • 12. Dimensions of data quality: ā— Consistency ā— Accuracy ā— Correctness ā— Objectivity ā— Timeliness ā— Conciseness ā— Precision ā— Usefulness ā— Unamiguous ā— Usability ā— Completeness ā— Relevance ā— Reliability ā— Amount of data
  • 13. Definitions of data quality dimensions: ā€¢Correctness / Accuracy: Accuracy of data is the degree to which the captured data correctly describes the real world entity. ā€¢Consistency: This is about the single version of truth. Consistency means data throughout the enterprise should be sync with each other. ā€¢Completeness: It is the extent to which the expected attributes of data are provided. ā€¢Timeliness: Right data to the right person at the right time is important for business.
  • 14. Definitions of data quality dimensions: ā€¢Correctness / Accuracy: Accuracy of data is the degree to which the captured data correctly describes the real world entity. Ability to draw correct conclusions from data Business process that match reality Eg of data accuracy issues: ā€¢ An incident reported with $23M when the loss was $12k ā€¢ The amount invoiced does not represent the customerā€™s usage
  • 15. Definitions of data quality dimensions: ā€¢Consistency: This is about the single version of truth. Consistency means throughout the enterprise should be sync with each other. Ability to trust data regardless of source Identical information available to all processes and units Eg of data consistency issues: ā€¢ Mr.A defines ā€œreprocessingā€ as cancel/total and Mr. B as Cancel/new.
  • 16. Definitions of data quality dimensions: ā€¢Completeness: It is the extent to which the expected attributes of data are provided. Data that does not leave any open questions Ability to make a good decision based on available data Closeness between ā€œneed to knowā€ and what data tells you Eg of data completeness issues: ā€¢ We cannot tell how many cell phone contracts Mr. X has ā€¢ A summary report includes projects that did not report status!
  • 17. Definitions of data quality dimensions: ā€¢Timeliness: Right data to the right person at the right time is important for business. Data that is available without delay Ability to know what you need, when you need Smooth information flow: ā€œData delayed is Data denied!ā€ Eg of data timeliness issues: ā€¢ Receiving a ā€œbudget exceededā€ SMS after you went over the limit!
  • 18. Real time situation Many database professionals face situations like: 1. Several data inconsistencies in source, like missing records or NULL values. 2. column they chose to be the primary key column is not unique throughout the table. 3. Schema design is not coherent to the end user requirement. 4. Any other concern with the data, that must have been fixed right at the beginning
  • 19. What does it mean to fix data quality issues? Make changes in ETL data flow packages, cleaning identified inconsistencies etc.. Lot of re-work to be done Added costs in terms of time and effort Soā€¦.. What is the solution???
  • 20. Solution ā€œPREVENTION IS BETTER THAN CUREā€ Hence data profiling comes to the rescue
  • 21. Measuring Data Quality Profiling ā€“ Understand metadata ā€¢ Point of time shows what data looks like now ā€¢ Automating shows trends o Alert to new/potential issues as they happen o Potentially fix issues in near real time
  • 22. Statistical process control Automated inspection Visibility shows process deviation
  • 23. Data profiling analysis Duplication Pattern matching Day of week Character set Reference data matching Inter-data set comparisons
  • 24. Master data management Create a standard for data Distribute data so that all sources are uniform ā€¢ Names ā€¢ Addresses ā€¢ Phone numbers ā€¢ Products Can hook into 3rd party sources
  • 25. Data Governance Central authority for data quality control Applies information collected from data profiling uniformly across the business Communication channels between business and IT groups
  • 26. Maintenance of data quality Data quality results from the process of going through the data and scrubbing it, standardizing it, and removing duplicate records, as well as doing some of the data enrichment. 1. Maintain complete data 2. Clean up data by standardizing using rules 3. Using algorithms to detect duplicates 4. Avoid entry of duplicate leads and contacts 5. Merge existing duplicate records 6. Use roles for security
  • 27. Inconsistent data before cleaning up Bill no CustomerName SSN 101 Ms Vijaya Kokkili SSN100123 Bill no CustomerName SSN 204 Ms V Kokkili SSN100123 Bill no CustomerName SSN 354 Ms Kokkili Vijaya SSN100123 Bill no CustomerName SSN 467 Ms Vijaya K SSN100123 Invoice 1 Invoice 2 Invoice 3 Invoice 4
  • 28. Consistent data after cleaning up Bill no CustomerName SSN 101 Ms Vijaya Kokkili SSN100123 Bill no CustomerName SSN 204 Ms Vijaya Kokkili SSN100123 Bill no CustomerName SSN 354 Ms Vijaya Kokkili SSN100123 Bill no CustomerName SSN 467 Ms Vijaya Kokkili SSN100123 Invoice 1 Invoice 2 Invoice 3 Invoice 4
  • 29. When and how to conduct data profiling? Generally, data profiling is conducted in two ways: 1.Writing SQL queries on sample data extracts put into a database. 2.Using data profiling tools When to conduct Data profiling? At the discovery/requirements gathering phase
  • 30. How to conduct data profiling? Data profiling involves statistical analysis of the data at source and the data being loaded, as well as analysis of metadata. These statistics may be used for various analysis purposes. Common examples of analyses to be done are: Data quality: Analyze the quality of data at the data source. NULL values: Look out for the number of NULL values in an attribute Candidate keys: Analysis of the extent to which certain columns are distinct will give developer useful information w. r. t. selection of candidate keys. Primary key selection: To check whether the candidate key column does not violate the basic requirements of not having NULL values or duplicate values. Empty string values: A string column may contain NULL or even empty sting values that may create problems later. String length: An analysis of largest and shortest possible length as well as the average string length of a sting-type column can help us decide what data type would be most suitable for the said column
  • 31. How to test for Data quality? Discrepancy in records count at Source & target When all data is at source is present at target Ensure that source & target donā€™t contain conflicting facts Degree of conformance of data to its domain and business values Physical and logical duplicates Orphan records in targets when no corresponding parent records List of valid/invalid values that are allowed along with ranges, look up etc Degree to which data reflects the real world objects Describes the relevance & meaning of data Describes availability of data as per SLA Row Count Completeness Consistency Validity Redundancy Referential Integrity Domain Integrity Accuracy Usability Timeliness
  • 32. Data quality test management Test planning Test design Test Execution Test monitoring Requirements: ā€¢ BRD ā€¢ FSD ā€¢ Test Plan Requirements: ā€¢ Test scenarios ā€¢ Test cases ā€¢ Automated Requirements: ā€¢ Executed in test cycles ā€¢ Test results/bugs are shared with business ā€¢ Prioritize Requirements: ā€¢ Collect metrics ā€¢ Observe trend
  • 33. Data quality testing challenges ā€¢ Lack of tools ā€¢ Lack of domain knowledge ā€¢ Changing requirements ā€¢ Poor planning for data quality in initial phase of the application
  • 34. Data quality testing best practices ā€¢ Understand user business ā€¢ Plan early in Design and testing phase ā€¢ Be proactive when it comes to data growth/trending ā€¢ Donā€™t assume! Understand data!

Editor's Notes

  1. Today is world of heterogeneity. We have different technologies. We operate on different platforms. We have large amount of data being generated everyday in all sorts of organizations and Enterprises.
  2. Fitbit Medical Life everyday routine
  3. Facts of Data quality: ā— Cost of poor data quality in US - $600 Billionā— Poor Data/Lack of visibility cited as #1 reason for project cost overrunsā— Poor data quality costs the US Economy $3.1 Trillion a yearā— Implementing data quality best practices boosts revenue by 66%ā— Median Fortune 1000 company could increase revenue by $2.01 Billion if they improved usability of data by 10% And we do have problems with data. Problems like: Duplicated , inconsistent , ambiguous, incomplete. So there is a need to collect data in one place and clean up the data
  4. Businesses are increasingly only as good as their data. High quality data is essential for capturing the interest of consumers and driving online sales.
  5. Increases customer satisfaction by ensuring theĀ accuracy of product informationĀ ā€“ ingredients, prices, nutritional information Strengthens trust andĀ collaborationĀ between trading partners IncreasesĀ supply chain efficienciesĀ and cuts costs by reducing errors Cuts delays at point-of-sale as a result ofĀ reduced measurement errors Increases theĀ reliability and efficiency of product transportationĀ and delivery to stores and warehouses Ensures betterĀ compliance with industry standardsĀ and regulations
  6. Why data quality matters? Good data is your most valuable asset, and bad data can seriously harm your business and credibilityā€¦ 1.What have you missed? 2.When things go wrong. 3.Making confident decisions Is the data trustworthy and credible information.
  7. Accuracy: What does accuracy stand for? Good fit between data and realityā€¦ā€¦ā€¦Ability to draw correct conclusions from dataā€¦ā€¦ā€¦ā€¦ā€¦ā€¦.Business process that match reality Eg: of data acc;uracy issues: An incident reported with $23M when the loss was $12kā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦.The amount invoiced does not represent the customerā€™s usage Consistency stands for: Data in harmony across the companyā€¦ā€¦ā€¦ā€¦..ability to trust data regardless of sourceā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦.Identical information available to all processes and units Eg: Mr.A defines ā€œreprocessingā€ as cancel/total and Mr. B as Cancel/new. Completeness stands for: Data that does not leave any open questionsā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦..Ability to make a good decision based on available dataā€¦ā€¦ā€¦ā€¦ā€¦ā€¦.Closeness between ā€œneed to knowā€ and what data tells you Eg: we cannot tell how many cell phone contracts Mr. X hasā€¦ā€¦ā€¦ā€¦ā€¦ā€¦A summary report includes projects that did not report status! Timeliness stands for: Data that is available without delayā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦Ability to know what you need, when you needā€¦ā€¦ā€¦ā€¦ā€¦ā€¦..smoothe information flow: data delayed is data denied!
  8. Accuracy: What does accuracy stand for? Good fit between data and realityā€¦ā€¦ā€¦Ability to draw correct conclusions from dataā€¦ā€¦ā€¦ā€¦ā€¦ā€¦.Business process that match reality Eg: of data acc;uracy issues: An incident reported with $23M when the loss was $12kā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦.The amount invoiced does not represent the customerā€™s usage
  9. Consistency stands for: Data in harmony across the companyā€¦ā€¦ā€¦ā€¦..ability to trust data regardless of sourceā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦.Identical information available to all processes and units Eg: Mr.A defines ā€œreprocessingā€ as cancel/total and Mr. B as Cancel/new.
  10. Completeness stands for: Data that does not leave any open questionsā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦..Ability to make a good decision based on available dataā€¦ā€¦ā€¦ā€¦ā€¦ā€¦.Closeness between ā€œneed to knowā€ and what data tells you Eg: we cannot tell how many cell phone contracts Mr. X hasā€¦ā€¦ā€¦ā€¦ā€¦ā€¦A summary report includes projects that did not report status!
  11. Timeliness stands for: Data that is available without delayā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦ā€¦Ability to know what you need, when you needā€¦ā€¦ā€¦ā€¦ā€¦ā€¦..smoothe information flow: data delayed is data denied!
  12. What is data profiling ? It is the process of statistically examining and analyzing the content in a data source, and hence collecting information about the data. It consists of techniques used to analyze the data we have for accuracy and completeness. 1. Data profiling helps us make a thorough assessment of data quality. 2. It assists the discovery of anomalies in data. 3. It helps us understand content, structure, relationships, etc. about the data in the data source we are analyzing. 4. It helps us know whether the existing data can be applied to other areas or purposes. 5. It helps us understand the various issues/challenges we may face in a database project much before the actual work begins. This enables us to make early decisions and act accordingly. 6. It is also used to assess and validate metadata
  13. It is important for QA to make sure these requirements are provided upfront.