SlideShare a Scribd company logo
Jim Harris
               Blogger‐in‐Chief
             www.ocdqblog.com
Jim Harris
             Digitally signed by Jim Harris
             DN: cn=Jim Harris, o=Obsessive-Compulsive Data Quality (OCDQ), ou, email=jim.harris@ocdqblog.
             com, c=US
             Date: 2010.03.04 10:55:20 -06'00'
Jim Harris
        Blogger‐in‐Chief
       www.ocdqblog.com



     E‐mail
     jim.harris@ocdqblog.com

     Twitter
     twitter.com/ocdqblog

     LinkedIn
     linkedin.com/in/jimharris




Adventures in Data Profiling     Copyright © 2010, Jim Harris. All rights reserved.   2
Let the Adventures Begin . . .
 This will be a vendor‐neutral presentation:

     Focusing on general methodology of data profiling 
     and common functionality of data profiling tools 

     Discussing how a data profiling tool helps automate 
     some of the work needed for preliminary data analysis

     Reviewing fictional data and results produced by a 
     fictional data profiling tool to illustrate basic concepts 


Adventures in Data Profiling      Copyright © 2010, Jim Harris. All rights reserved.   3
Understanding Your Data
     Understanding your data is essential to using it 
     effectively and improving its quality

     You need a reality check for the perceptions and 
     assumptions you have about the quality of your data 

     You need to prepare meaningful questions to ask your 
     business analysts and subject matter experts

     There is simply no substitute for data analysis 

Adventures in Data Profiling     Copyright © 2010, Jim Harris. All rights reserved.   4
Profiling Your Data
 Data profiling includes many types of analysis such as:

     Verify data matches the metadata that describes it

     Identify representations for the absence of data

     Identify potential default and invalid values

     Check data formats for inconsistencies

     Assess domain, structural, and relational integrity

Adventures in Data Profiling     Copyright © 2010, Jim Harris. All rights reserved.   5
Getting Your Data Freq On
     Data profiling tools can help you by automating some 
     of the grunt work needed to begin your data analysis

     One of their basic features is the ability to generate 
     statistical summaries and frequency distributions for 
     the unique values and formats found within your fields

  Therefore, I like to refer to using a data profiling tool as:
              “Getting Your Data Freq On”

Adventures in Data Profiling     Copyright © 2010, Jim Harris. All rights reserved.   6
Let Me Count The Ways


NULL – record count of NULL values          Cardinality – count of the number of 
                                            distinct actual values
Missing – record count of Missing values 
(i.e., non‐NULL absence of data)            Uniqueness – percentage calculated as 
                                            Cardinality divided by total record count
Actual – record count of Actual values  
(i.e., non‐NULL and non‐Missing)            Distinctness – percentage calculated as 
                                            Cardinality divided by Actual
Completeness – percentage calculated as 
Actual divided by total record count
  Adventures in Data Profiling              Copyright © 2010, Jim Harris. All rights reserved.   7
You Uniquely Complete Me
                                   Completeness and 
                                   Uniqueness are useful in 
                                   evaluating potential key 
                                   fields and especially a 
                                   single primary key, 
                                   which should be both: 
                                         100% Complete
                                         100% Unique




Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   8
It’s a Distinct Possibility



                                 Distinctness can be useful 
                                 in evaluating the potential
                                 for duplicate records
                                 < 100% Distinct means some 
                                 distinct actual values occur on 
                                 more than one record
Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   9
Gimme the lo down, Drill‐down




Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   10
Freq’ing Distribution of Values



                                        Frequency distribution of 
                                        values is useful for fields 
                                        with a low cardinality
                                        Extremely low cardinality 
                                        might be an indication of 
                                        default values

Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   11
Reviewing the Top N List




Reviewing the Top N most 
frequently occurring values
 Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   12
Freq’ing Distribution of Formats




Frequency distribution of formats is useful for fields having 
both a high cardinality and free‐form values
Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   13
Unlocking the Combination




Combination of values 
and formats can help with 
unlocking the mystery of 
more complex fields

  Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   14
. . . the Adventures Conclude
What can just your analysis of data tell you about it?

    Understand your data better by first looking at it from a 
    starting point of blissful ignorance and curiosity

    A tool can help automate some of the grunt work, but 
    the actual data analysis can not be automated

    Your analytical goal is not to try to find answers, but to 
    discover the right questions

Adventures in Data Profiling      Copyright © 2010, Jim Harris. All rights reserved.   15

More Related Content

What's hot

Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeData quality - The True Big Data Challenge
Data quality - The True Big Data Challenge
Stefan Kühn
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architecture
anicewick
 
Sound Data Quality for CRM
Sound Data Quality for CRMSound Data Quality for CRM
Sound Data Quality for CRMDivya Malik
 
Data profiling-best-practices
Data profiling-best-practicesData profiling-best-practices
Data profiling-best-practices
Blaise Cheuteu
 
Data Quality Technical Architecture
Data Quality Technical ArchitectureData Quality Technical Architecture
Data Quality Technical ArchitectureHarshendu Desai
 
Lecture 22
Lecture 22Lecture 22
Lecture 22
Shani729
 
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...
University of Twente
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
Dr. C.V. Suresh Babu
 
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data QualityBig Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
BigDataExpo
 
Lecture 21
Lecture 21Lecture 21
Lecture 21
Shani729
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Denny Lee
 
Lecture 23
Lecture 23Lecture 23
Lecture 23
Shani729
 
Tamr overview
Tamr overviewTamr overview
Tamr overview
Meg Vorland
 
Data analytics
Data analyticsData analytics
Foundation of data quality
Foundation of data qualityFoundation of data quality
Foundation of data quality
Khaled Mosharraf
 
Enterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareEnterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for Healthcare
DATA360US
 
Data analytics
Data analyticsData analytics
Data analytics
Dr.Bhuvaneswari Velumani
 
Machine Learning and Multi Drug Resistant(MDR) Infections case study
Machine Learning and Multi Drug Resistant(MDR) Infections case studyMachine Learning and Multi Drug Resistant(MDR) Infections case study
Machine Learning and Multi Drug Resistant(MDR) Infections case study
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
Data quality management Basic
Data quality management BasicData quality management Basic
Data quality management Basic
Khaled Mosharraf
 

What's hot (20)

Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeData quality - The True Big Data Challenge
Data quality - The True Big Data Challenge
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architecture
 
Sound Data Quality for CRM
Sound Data Quality for CRMSound Data Quality for CRM
Sound Data Quality for CRM
 
Data profiling-best-practices
Data profiling-best-practicesData profiling-best-practices
Data profiling-best-practices
 
Data Quality Technical Architecture
Data Quality Technical ArchitectureData Quality Technical Architecture
Data Quality Technical Architecture
 
Lecture 22
Lecture 22Lecture 22
Lecture 22
 
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data QualityBig Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
 
Lecture 21
Lecture 21Lecture 21
Lecture 21
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
 
Lecture 23
Lecture 23Lecture 23
Lecture 23
 
Tamr overview
Tamr overviewTamr overview
Tamr overview
 
Data analytics
Data analyticsData analytics
Data analytics
 
Foundation of data quality
Foundation of data qualityFoundation of data quality
Foundation of data quality
 
Enterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareEnterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for Healthcare
 
Data analytics
Data analyticsData analytics
Data analytics
 
Machine Learning and Multi Drug Resistant(MDR) Infections case study
Machine Learning and Multi Drug Resistant(MDR) Infections case studyMachine Learning and Multi Drug Resistant(MDR) Infections case study
Machine Learning and Multi Drug Resistant(MDR) Infections case study
 
Data quality management Basic
Data quality management BasicData quality management Basic
Data quality management Basic
 

Viewers also liked

Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
Shailja Khurana
 
2007 Tidc India Profiling
2007 Tidc India Profiling2007 Tidc India Profiling
2007 Tidc India Profilingdanrinkes
 
Marketing Data Utilization
Marketing Data UtilizationMarketing Data Utilization
Marketing Data Utilization
CRMIT
 
Uncover Untold Stories in Your Data: A Deep Dive on Data Profiling
Uncover Untold Stories in Your Data: A Deep Dive on Data ProfilingUncover Untold Stories in Your Data: A Deep Dive on Data Profiling
Uncover Untold Stories in Your Data: A Deep Dive on Data Profiling
Josiah Renaudin
 
NCDM Datamining Case Study 2010
NCDM Datamining Case Study 2010NCDM Datamining Case Study 2010
NCDM Datamining Case Study 2010
Jim Stafford
 
Bitcoin, Transaction Fees and The Cost of Poor Quality
Bitcoin, Transaction Fees and The Cost of Poor QualityBitcoin, Transaction Fees and The Cost of Poor Quality
Bitcoin, Transaction Fees and The Cost of Poor Quality
RSky215
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
huguk
 
Telling Visual Stories About Data
Telling Visual Stories About DataTelling Visual Stories About Data
Telling Visual Stories About Data
Congressional Budget Office
 
Exposing Your Hidden Costs of Performance
Exposing Your Hidden Costs of PerformanceExposing Your Hidden Costs of Performance
Exposing Your Hidden Costs of Performance
Juran Global
 
Cost of poor quality
Cost of  poor qualityCost of  poor quality
Cost of poor quality
Abdullah Sasy
 
Cost of poor quality presentation5
Cost of poor quality presentation5Cost of poor quality presentation5
Cost of poor quality presentation5Imran Jamil
 
Cost of-poor-quality - juran institute
Cost of-poor-quality - juran instituteCost of-poor-quality - juran institute
Cost of-poor-quality - juran institute
Manish Chaurasia
 
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with DataFifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
Amit Kapoor
 
Quality is a cost
Quality is a costQuality is a cost
Quality is a costbatch18
 
Crafting Visual Stories with Data
Crafting Visual Stories with DataCrafting Visual Stories with Data
Crafting Visual Stories with Data
Amit Kapoor
 
Big Data Profiling
Big Data Profiling Big Data Profiling
Big Data Profiling
eXascale Infolab
 
Cost of Poor quality
Cost of  Poor qualityCost of  Poor quality
Cost of Poor quality
Raghvendra Rangaswamy Gopal
 

Viewers also liked (20)

Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
2007 Tidc India Profiling
2007 Tidc India Profiling2007 Tidc India Profiling
2007 Tidc India Profiling
 
Marketing Data Utilization
Marketing Data UtilizationMarketing Data Utilization
Marketing Data Utilization
 
Uncover Untold Stories in Your Data: A Deep Dive on Data Profiling
Uncover Untold Stories in Your Data: A Deep Dive on Data ProfilingUncover Untold Stories in Your Data: A Deep Dive on Data Profiling
Uncover Untold Stories in Your Data: A Deep Dive on Data Profiling
 
NCDM Datamining Case Study 2010
NCDM Datamining Case Study 2010NCDM Datamining Case Study 2010
NCDM Datamining Case Study 2010
 
Bitcoin, Transaction Fees and The Cost of Poor Quality
Bitcoin, Transaction Fees and The Cost of Poor QualityBitcoin, Transaction Fees and The Cost of Poor Quality
Bitcoin, Transaction Fees and The Cost of Poor Quality
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Telling Visual Stories About Data
Telling Visual Stories About DataTelling Visual Stories About Data
Telling Visual Stories About Data
 
Exposing Your Hidden Costs of Performance
Exposing Your Hidden Costs of PerformanceExposing Your Hidden Costs of Performance
Exposing Your Hidden Costs of Performance
 
Cost of poor quality
Cost of  poor qualityCost of  poor quality
Cost of poor quality
 
Cost of poor quality presentation5
Cost of poor quality presentation5Cost of poor quality presentation5
Cost of poor quality presentation5
 
Cost of-poor-quality - juran institute
Cost of-poor-quality - juran instituteCost of-poor-quality - juran institute
Cost of-poor-quality - juran institute
 
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with DataFifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
 
Quality is a cost
Quality is a costQuality is a cost
Quality is a cost
 
QUALITY & COST
QUALITY & COSTQUALITY & COST
QUALITY & COST
 
Crafting Visual Stories with Data
Crafting Visual Stories with DataCrafting Visual Stories with Data
Crafting Visual Stories with Data
 
Big Data Profiling
Big Data Profiling Big Data Profiling
Big Data Profiling
 
Cost of Poor quality
Cost of  Poor qualityCost of  Poor quality
Cost of Poor quality
 
Cost of quality
Cost of qualityCost of quality
Cost of quality
 
Cost of quality
Cost of qualityCost of quality
Cost of quality
 

Similar to Adventures in Data Profiling

Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
Privacy in AI/ML Systems: Practical Challenges and Lessons LearnedPrivacy in AI/ML Systems: Practical Challenges and Lessons Learned
Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
Krishnaram Kenthapadi
 
IDOL presentation
IDOL presentationIDOL presentation
IDOL presentation
Andrey Karpov
 
Analytics At Work T. Davenport
Analytics At Work T. DavenportAnalytics At Work T. Davenport
Analytics At Work T. DavenportSaurabh Shah
 
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
SAS Institute India Pvt. Ltd
 
Information Management and Analytics
Information Management and Analytics Information Management and Analytics
Information Management and Analytics AKAGroup
 
Big Data, Analytics, and Social Good - The Challenges, The Opportunity
Big Data, Analytics, and Social Good - The Challenges, The OpportunityBig Data, Analytics, and Social Good - The Challenges, The Opportunity
Big Data, Analytics, and Social Good - The Challenges, The OpportunityJaime Fitzgerald
 
Week2day2 communicating data for impact
Week2day2 communicating data for impactWeek2day2 communicating data for impact
Week2day2 communicating data for impact
Nishant Kumar
 
Why does telling a story with your data matters  Explain the impo.docx
Why does telling a story with your data matters  Explain the impo.docxWhy does telling a story with your data matters  Explain the impo.docx
Why does telling a story with your data matters  Explain the impo.docx
franknwest27899
 
Big Data v2.pptx
Big Data v2.pptxBig Data v2.pptx
Big Data v2.pptx
Pecific University
 
Data quality metrics infographic
Data quality metrics infographicData quality metrics infographic
Data quality metrics infographic
Intellspot
 
How Data as a Service can make your campaigns more successful - Dun and Brads...
How Data as a Service can make your campaigns more successful - Dun and Brads...How Data as a Service can make your campaigns more successful - Dun and Brads...
How Data as a Service can make your campaigns more successful - Dun and Brads...
B2B Marketing
 
Great Learning Amit Sharma March 2018
Great Learning Amit Sharma March 2018Great Learning Amit Sharma March 2018
Great Learning Amit Sharma March 2018
Amit Sharma
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data science
Thinkful
 
For the Love of Big Data
For the Love of Big DataFor the Love of Big Data
For the Love of Big Data
Robert Sutor
 
4 Critical Requirements for Building Truly Intelligent AI Models
4 Critical Requirements for Building Truly Intelligent AI Models4 Critical Requirements for Building Truly Intelligent AI Models
4 Critical Requirements for Building Truly Intelligent AI Models
Innodata, Inc
 
HPE IDOL Technical Overview - july 2016
HPE IDOL Technical Overview - july 2016HPE IDOL Technical Overview - july 2016
HPE IDOL Technical Overview - july 2016
Andrey Karpov
 
Healthcare Best Practices in Data Warehousing & Analytics
Healthcare Best Practices in Data Warehousing & AnalyticsHealthcare Best Practices in Data Warehousing & Analytics
Healthcare Best Practices in Data Warehousing & Analytics
Dale Sanders
 
Dark data
Dark dataDark data
Dark data
Amir Sedighi
 
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Gianluca Tarasconi
 

Similar to Adventures in Data Profiling (20)

Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
Privacy in AI/ML Systems: Practical Challenges and Lessons LearnedPrivacy in AI/ML Systems: Practical Challenges and Lessons Learned
Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
 
IDOL presentation
IDOL presentationIDOL presentation
IDOL presentation
 
Analytics At Work T. Davenport
Analytics At Work T. DavenportAnalytics At Work T. Davenport
Analytics At Work T. Davenport
 
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
 
Information Management and Analytics
Information Management and Analytics Information Management and Analytics
Information Management and Analytics
 
Big Data, Analytics, and Social Good - The Challenges, The Opportunity
Big Data, Analytics, and Social Good - The Challenges, The OpportunityBig Data, Analytics, and Social Good - The Challenges, The Opportunity
Big Data, Analytics, and Social Good - The Challenges, The Opportunity
 
Week2day2 communicating data for impact
Week2day2 communicating data for impactWeek2day2 communicating data for impact
Week2day2 communicating data for impact
 
Why does telling a story with your data matters  Explain the impo.docx
Why does telling a story with your data matters  Explain the impo.docxWhy does telling a story with your data matters  Explain the impo.docx
Why does telling a story with your data matters  Explain the impo.docx
 
Big Data v2.pptx
Big Data v2.pptxBig Data v2.pptx
Big Data v2.pptx
 
Complete Guide to Data Quality
Complete Guide to Data QualityComplete Guide to Data Quality
Complete Guide to Data Quality
 
Data quality metrics infographic
Data quality metrics infographicData quality metrics infographic
Data quality metrics infographic
 
How Data as a Service can make your campaigns more successful - Dun and Brads...
How Data as a Service can make your campaigns more successful - Dun and Brads...How Data as a Service can make your campaigns more successful - Dun and Brads...
How Data as a Service can make your campaigns more successful - Dun and Brads...
 
Great Learning Amit Sharma March 2018
Great Learning Amit Sharma March 2018Great Learning Amit Sharma March 2018
Great Learning Amit Sharma March 2018
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data science
 
For the Love of Big Data
For the Love of Big DataFor the Love of Big Data
For the Love of Big Data
 
4 Critical Requirements for Building Truly Intelligent AI Models
4 Critical Requirements for Building Truly Intelligent AI Models4 Critical Requirements for Building Truly Intelligent AI Models
4 Critical Requirements for Building Truly Intelligent AI Models
 
HPE IDOL Technical Overview - july 2016
HPE IDOL Technical Overview - july 2016HPE IDOL Technical Overview - july 2016
HPE IDOL Technical Overview - july 2016
 
Healthcare Best Practices in Data Warehousing & Analytics
Healthcare Best Practices in Data Warehousing & AnalyticsHealthcare Best Practices in Data Warehousing & Analytics
Healthcare Best Practices in Data Warehousing & Analytics
 
Dark data
Dark dataDark data
Dark data
 
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
 

Recently uploaded

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Adventures in Data Profiling

  • 1. Jim Harris Blogger‐in‐Chief www.ocdqblog.com Jim Harris Digitally signed by Jim Harris DN: cn=Jim Harris, o=Obsessive-Compulsive Data Quality (OCDQ), ou, email=jim.harris@ocdqblog. com, c=US Date: 2010.03.04 10:55:20 -06'00'
  • 2. Jim Harris Blogger‐in‐Chief www.ocdqblog.com E‐mail jim.harris@ocdqblog.com Twitter twitter.com/ocdqblog LinkedIn linkedin.com/in/jimharris Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 2
  • 3. Let the Adventures Begin . . . This will be a vendor‐neutral presentation: Focusing on general methodology of data profiling  and common functionality of data profiling tools  Discussing how a data profiling tool helps automate  some of the work needed for preliminary data analysis Reviewing fictional data and results produced by a  fictional data profiling tool to illustrate basic concepts  Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 3
  • 4. Understanding Your Data Understanding your data is essential to using it  effectively and improving its quality You need a reality check for the perceptions and  assumptions you have about the quality of your data  You need to prepare meaningful questions to ask your  business analysts and subject matter experts There is simply no substitute for data analysis  Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 4
  • 5. Profiling Your Data Data profiling includes many types of analysis such as: Verify data matches the metadata that describes it Identify representations for the absence of data Identify potential default and invalid values Check data formats for inconsistencies Assess domain, structural, and relational integrity Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 5
  • 6. Getting Your Data Freq On Data profiling tools can help you by automating some  of the grunt work needed to begin your data analysis One of their basic features is the ability to generate  statistical summaries and frequency distributions for  the unique values and formats found within your fields Therefore, I like to refer to using a data profiling tool as: “Getting Your Data Freq On” Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 6
  • 7. Let Me Count The Ways NULL – record count of NULL values Cardinality – count of the number of  distinct actual values Missing – record count of Missing values  (i.e., non‐NULL absence of data) Uniqueness – percentage calculated as  Cardinality divided by total record count Actual – record count of Actual values   (i.e., non‐NULL and non‐Missing) Distinctness – percentage calculated as  Cardinality divided by Actual Completeness – percentage calculated as  Actual divided by total record count Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 7
  • 8. You Uniquely Complete Me Completeness and  Uniqueness are useful in  evaluating potential key  fields and especially a  single primary key,  which should be both:  100% Complete 100% Unique Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 8
  • 9. It’s a Distinct Possibility Distinctness can be useful  in evaluating the potential for duplicate records < 100% Distinct means some  distinct actual values occur on  more than one record Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 9
  • 10. Gimme the lo down, Drill‐down Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 10
  • 11. Freq’ing Distribution of Values Frequency distribution of  values is useful for fields  with a low cardinality Extremely low cardinality  might be an indication of  default values Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 11
  • 12. Reviewing the Top N List Reviewing the Top N most  frequently occurring values Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 12
  • 15. . . . the Adventures Conclude What can just your analysis of data tell you about it? Understand your data better by first looking at it from a  starting point of blissful ignorance and curiosity A tool can help automate some of the grunt work, but  the actual data analysis can not be automated Your analytical goal is not to try to find answers, but to  discover the right questions Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 15