Data Quality -
Dimensions & Tools
Yash Kumar
San Jose State University
CMPE 255, Data Mining
1
Why Data Quality Matters
2
Why Data Quality Matters
3
Why Data Quality Matters
4
Why DQ Tools
Data Profiling
Analysis and summary of datasets to
understand their structure, content, and
quality.
Monitoring
Continuously evaluate DQ over time via
calculated metrics and generated
reports on detected issues.
Issue Detection
Identifying specific DQ problems such as
missing values, wrong format, etc.
03
01 02
5
Why DQ Tools
6
DQ Dimensions & Metrics
7
DQ Dimensions & Metrics: Intrinsic
Intrinsic dimension can be assessed by measuring internal attributes or characteristics
of data based. It also measures missing values and redundant cases.
● Correctness
● Duplication
● Trustworthiness
8
DQ Dimensions & Metrics: Contextual
Contextual dimension ensures that the data aligns with the needs and goals of the ML
projects.
● Class Imbalance
● Completeness
● Comprehensiveness
● Unbiasedness
● Variety
9
DQ Dimensions & Metrics:
Representational
Representational dimension assesses the formats and structures of data, such as if the
data is concisely and consistently represented, but also interpretable.
● Conformity
● Consistency
10
DQ Dimensions & Metrics: Accessibility
Accessibility dimension evaluates the extent of obtaining either the entire or some
portion of the data. Availability allows users to be able to use and share the data with
safety controls.
● Availability
11
Data Cascade
Dataset with bad DQ Poor DQ
Metrics
Negative Effects
12
Challenges in DQ
Constantly evolving metrics
02 Rapid developments in ML mean that DQ metrics need to be
updated constantly
Lack of standardized metrics
01 Overlap between dimensions complicates evaluations
Large scale, multimodal data
03 Adds complexity to the system
Constraints
04 All projects have time and cost constraints
13
DQ Tools: Summary
14
DQ Tools: Trends
15
DQ Tools: Comparison
● AI-embedded tools capable of facilitating
and rerunning workflows
Automation &
Monitoring
5
● Several tools provide it
● Some provide comprehensive series of
transforms
Data Transformation
4
● Multiple sources to one data store
● Ataccama ONE handles large loads
Data Integration
3
● Most tools provide it
● Some focus entirely on it
● Some provide additional features
Data Profiling
2
● Become more complete over time
● Providing new functionalities
Overall Trend
1
● Most support 5 metrics
● Latest tools support more
Adopted Metrics
6
● Some simple, some with good UX &
documentation
● Some provide low-code solutions
User Interface
7 16
DQ Tools: Proposed Roadmap
Background
Understanding
Attain knowledge of
DQ, existing
limitations, and
market expectations.
Scope & Key
Features
Define the scope, key
features, and metrics
to create the tool. Ask
the right questions.
Implementatio
n
Robust & scalable
implementation of
tech stack, functions,
connectors, and
metrics.
Documentation
Provide clear
documentation and
engage with
community to
encourage open-
source culture.
User Interface
Design an interactive
UI with
considerations for
non-tech people.
17
DQ Tools: Workflow
18
My Insights
My understanding of DQ Tools and
Dimensions after reading the paper:
19
My Insights
On a serious note…
20
My Insights
21
My Insights
Specific
Scenarios
General
Scenarios
VS
22
Conclusion
● Data Quality is paramount - the very foundation to successful ML
projects.
● Data Quality Tools are very useful in a Data-Centric ML lifecycle.
● Current tools are great, especially the latest ones but…
● There is room to improve for them in terms of AI integration,
automation, and usability.
23
Acknowledgments
Special thanks to the authors of A Survey on Data Quality Dimensions and
Tools for Machine Learning.
Citation: Zhou, Y., Tu, F., Sha, K., Ding, J., & Chen, H. (2024)
Available at https://arxiv.org/pdf/2406.19614v1
24
Thank you!
25

Data Quality - Dimensions & Tools: A Survey.pptx

  • 1.
    Data Quality - Dimensions& Tools Yash Kumar San Jose State University CMPE 255, Data Mining 1
  • 2.
  • 3.
  • 4.
  • 5.
    Why DQ Tools DataProfiling Analysis and summary of datasets to understand their structure, content, and quality. Monitoring Continuously evaluate DQ over time via calculated metrics and generated reports on detected issues. Issue Detection Identifying specific DQ problems such as missing values, wrong format, etc. 03 01 02 5
  • 6.
  • 7.
    DQ Dimensions &Metrics 7
  • 8.
    DQ Dimensions &Metrics: Intrinsic Intrinsic dimension can be assessed by measuring internal attributes or characteristics of data based. It also measures missing values and redundant cases. ● Correctness ● Duplication ● Trustworthiness 8
  • 9.
    DQ Dimensions &Metrics: Contextual Contextual dimension ensures that the data aligns with the needs and goals of the ML projects. ● Class Imbalance ● Completeness ● Comprehensiveness ● Unbiasedness ● Variety 9
  • 10.
    DQ Dimensions &Metrics: Representational Representational dimension assesses the formats and structures of data, such as if the data is concisely and consistently represented, but also interpretable. ● Conformity ● Consistency 10
  • 11.
    DQ Dimensions &Metrics: Accessibility Accessibility dimension evaluates the extent of obtaining either the entire or some portion of the data. Availability allows users to be able to use and share the data with safety controls. ● Availability 11
  • 12.
    Data Cascade Dataset withbad DQ Poor DQ Metrics Negative Effects 12
  • 13.
    Challenges in DQ Constantlyevolving metrics 02 Rapid developments in ML mean that DQ metrics need to be updated constantly Lack of standardized metrics 01 Overlap between dimensions complicates evaluations Large scale, multimodal data 03 Adds complexity to the system Constraints 04 All projects have time and cost constraints 13
  • 14.
  • 15.
  • 16.
    DQ Tools: Comparison ●AI-embedded tools capable of facilitating and rerunning workflows Automation & Monitoring 5 ● Several tools provide it ● Some provide comprehensive series of transforms Data Transformation 4 ● Multiple sources to one data store ● Ataccama ONE handles large loads Data Integration 3 ● Most tools provide it ● Some focus entirely on it ● Some provide additional features Data Profiling 2 ● Become more complete over time ● Providing new functionalities Overall Trend 1 ● Most support 5 metrics ● Latest tools support more Adopted Metrics 6 ● Some simple, some with good UX & documentation ● Some provide low-code solutions User Interface 7 16
  • 17.
    DQ Tools: ProposedRoadmap Background Understanding Attain knowledge of DQ, existing limitations, and market expectations. Scope & Key Features Define the scope, key features, and metrics to create the tool. Ask the right questions. Implementatio n Robust & scalable implementation of tech stack, functions, connectors, and metrics. Documentation Provide clear documentation and engage with community to encourage open- source culture. User Interface Design an interactive UI with considerations for non-tech people. 17
  • 18.
  • 19.
    My Insights My understandingof DQ Tools and Dimensions after reading the paper: 19
  • 20.
    My Insights On aserious note… 20
  • 21.
  • 22.
  • 23.
    Conclusion ● Data Qualityis paramount - the very foundation to successful ML projects. ● Data Quality Tools are very useful in a Data-Centric ML lifecycle. ● Current tools are great, especially the latest ones but… ● There is room to improve for them in terms of AI integration, automation, and usability. 23
  • 24.
    Acknowledgments Special thanks tothe authors of A Survey on Data Quality Dimensions and Tools for Machine Learning. Citation: Zhou, Y., Tu, F., Sha, K., Ding, J., & Chen, H. (2024) Available at https://arxiv.org/pdf/2406.19614v1 24
  • 25.