Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)

Of Unicorns, Yetis, and
Error-Free Datasets
(or what is data quality?)
EMIT BOCCONI 9/10/23
Gianluca Tarasconi, CDO, ipQuants AG gt@ipQuants.com

About the author
 Background in Engineering
 18 years in Bocconi research centers (CESPRI, ICRIOS…) as data architect;
 Consultant for EPO, USPTO, WIPO, OECD, World Bank, E&Y, MIUR, MIT, LSE
…
 Publications record: https://scholar.google.it/citations?user=UQQgBXEAAAAJ
 Nowadays Chief Data Officier at ipQuants AG
2

About www.ipQuants.com
 Provides an AI driven digital cockpit (Qthena) for patent attorneys;
 Statistics and powerful indicators on the different phases of patenting
process and lawfirms competitive intelligence.
3

CDO main tasks
1. Data Governance: Develop and enforce data policies, standards, and
procedures to ensure data integrity and compliance.
2. Data Strategy: Craft a strategic vision for data utilization, aligning it with
organizational goals.
3. Data Quality: Oversee data quality management, ensuring accurate and reliable
data for decision-making.
4. Data Privacy & Security: Protect sensitive data through robust security
measures and compliance with data privacy regulations.
5. Data Analytics: Drive data-driven decision-making by promoting data analytics
and insights across the organization.
6. Data Innovation: Foster a culture of innovation by exploring new data
technologies and opportunities.
7. Data Talent: Build and nurture a skilled data team to execute the data strategy
effectively.
4

Error free data are an Utopia
“Utopia is on the horizon. I move two steps
closer; it moves two steps further away. I walk
another ten steps and the horizon runs ten steps
further away. As much as I may walk, I'll never
reach it. So what's the point of utopia? The point
is this: to keep walking.”
Eduardo Galeano
Aiming to error free data does not mean you will
get them but provides you a direction in data
management.
5

A few unicorns exist…
 A gold standard dataset is a meticulously curated and benchmark
collection of data that serves as the definitive reference for evaluating the
accuracy and performance of algorithms, models, or systems. It represents
the highest attainable quality and correctness in data, making it a trusted
baseline for comparisons.
 Gold standard datasets are often painfully annotated or labeled by domain
experts to ensure their reliability.
 They are used in various fields, including machine learning, natural
language processing, and medical research, where precision and validation
are crucial. Researchers and practitioners use gold standard datasets to
measure and improve the quality and effectiveness of their data-driven
solutions.
6

But in every days (data) life:
 Have in place processes of continuous improvement fo data quality;
 Avoid Bias in data;
 Respect existing laws and regulations.
 European legislation on data processing governed by regulation 2016/679
having to process your personal data, it has the duty to verify that they are
lawful, relevant, updated and correct.
7

What is a Bias in data?
 Bias in datasets refers to the presence of systematic and unfair
inaccuracies, favoring specific groups, perspectives, or outcomes, often
reflecting societal prejudices or limitations in data collection.
 This bias can manifest in various ways, including underrepresentation or
overrepresentation of certain demographics, cultural biases in language or
image data, or skewed sampling methods.
 Identifying and mitigating bias in datasets is crucial (know your enemy) for
ensuring fairness, transparency, and equity in data-driven applications and
decision systems. It requires careful data curation, diverse representation,
and ongoing evaluation to address these challenges effectively.
8

Data biases taxonomy
 from https://www.statice.ai/post/data-bias-types
 Selection bias
 Randomization is the process that balances out the effects of uncontrollable
factors - variables in a data set that are not specifically measured and can
compromise results. In data science, selection bias occurs when you have data
that aren’t properly randomized.
 Overgeneralization Bias
 When a person applies something from one event to all future events, it is
overgeneralization.
 See Russel's inductivist Turkey
9

Data biases taxonomy (II) 10
 Reporting Biases
 A reporting bias is the inclusion of only a subset of results in an analysis, which
typically only covers a small fraction of evidence.
 As an example, a sentiment analysis model can be trained to predict whether a
book review on a popular website is positive or negative. The vast majority of
reviews in the training data set reflect extreme opinions (reviewers who either
adored or despised a book). This was because people were less likely to review a
book they did not feel strongly about. Because of this, the model is less likely to
accurately predict sentiment of reviews that use more subtle language to describe a
book.

Data biases taxonomy (III)
 Group Attribution Biases
 Group attribution biases refer to the human tendency to assume that an
individual's characteristics are always determined by the beliefs of the group.
The group attribution bias manifests itself when you give preference to your
own group (in-group bias) or when you stereotype members of groups you
don't belong to (out-group bias).
 For example, engineers might be predisposed to believe that applicants who
attended the same school as they did are better qualified for a job when
training a résumé-screening model for software developers.
 Implicit Biases
 Implicit biases occur when we make assumptions based on our personal
experiences. Implicit bias manifests itself as attitudes and stereotypes we hold
about others, even when we are unaware of it.
11

What is quality
“Quality...you know what it is, yet you don’t know what it is. But
that’s self-contradictory. But some things are better than others,
that is, they have more quality. But when you try to say what the
quality is, apart from the things that have it, it all goes poof! […]
If no one knows what it is, then for all practical purposes it
doesn’t exist at all. But for all practical purposes it really does
exist.”
Robert M. Pirsig
Zen and the Art of Motorcycle Maintenance
12

Data governance main principles
 Data Quality: Maintain data accuracy, consistency, and reliability by setting and enforcing data
quality standards.
 Data Ownership: Clearly define roles and responsibilities for data stewardship and ownership
throughout the organization.
 Data Compliance: Ensure adherence to data privacy laws, industry regulations, and internal
policies.
 Data Documentation: Document metadata1, data lineage2, and data definitions to enhance data
understanding and usability.
 Data Lifecycle Management: Define processes for data creation, storage, archiving, and
disposal in line with business needs.
 Data Auditing and Monitoring: Implement access controls and permissions. Establish
monitoring mechanisms to detect and address data issues, anomalies, and breaches.
 Continuous Improvement: Continuously assess and refine data governance processes to adapt
to changing business needs and technological advancements.
 Data Ethics: Consider ethical implications in data usage, ensuring that data-driven decisions
align with ethical standards.
1 = defined as the information that describes and explains data.
2 = Data lineage includes the data origin, what happens to it, and where it moves over time.
13

Know your Enemy…
 Data profiling is the process of examining, analyzing, and creating useful
summaries of data
 Collecting descriptive statistics like min, max, count and sum.
 Collecting data types, length and recurring patterns.
 Tagging data with keywords, descriptions or categories.
 Performing data quality assessment, risk of performing joins on the data.
 Discovering metadata and assessing its accuracy.
14

Python libraries for data profiling
 Data profiling is the process of examining and analyzing data to gain insights into its
structure, quality, completeness, and other characteristics.
 ydata_profiling: is a package for data profiling, that automates and standardizes the
generation of detailed reports, complete with statistics and visualizations;
 Lux: is a Python library that facilitates fast and easy data exploration by automating the
visualization and data analysis process. By simply printing out a dataframe in a Jupyter
notebook, Lux recommends a set of visualizations highlighting interesting trends and
patterns in the dataset.
 DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive
data detection easy. Loading Data with a single command, the library automatically
formats & loads files into a DataFrame. Profiling the Data, the library identifies the
schema, statistics, entities (PII / NPI), and more. Data Profiles can then be used in
downstream applications or reports.
15

Example of data profiling
 Data profiling using ydata_profiling
16

Data Quality Dimensions
• Accuracy
• Completeness
• Consistency
• Timeliness
• Relevance
• Validity
• Uniqueness
17

Accuracy
 Accuracy refers to the degree to which data
accurately represents the real-world entity or event it
is meant to describe. It involves minimizing errors
and discrepancies in data.
• Ensuring data values are correct and precise.
• Eliminating data entry mistakes and inaccuracies.
• Validating data against reliable sources to maintain
accuracy.
18

Examples on patent data
 A patent dataset typically contains a comprehensive collection of information related to patents examined by a particular
patent office or jurisdiction. It serves as a resource for various purposes, including research, innovation, and intellectual
property management. Here's a brief description of the typical content found in a patent dataset:
 Patent Identification: This includes a unique patent identifier, such as a patent number, that distinguishes each patent
record within the dataset.
 Inventor Information: Details about the inventors, including their names, affiliations, and sometimes their contact
information.
 Patent Title: A concise title that describes the invention or technology covered by the patent.
 Abstract: A brief summary of the patent's contents, providing an overview of the patented innovation.
 Patent Claims: The patent claims section outlines the specific legal protections and rights granted by the patent. It defines
the boundaries of the invention.
 Filing and Grant Dates: The dates when the patent application was filed and when the patent was granted, providing a
timeline for the patent's life.
 Assignee Information: Details about the entity or entities to whom the patent rights have been assigned or who hold
ownership of the patent.
 Citations: Information about prior art or related patents that influenced or were cited by the patent in question. This can
provide insights into the innovation's context.
 Technology Classification: Patents are often categorized into specific technology classes or subclasses based on their
subject matter. These classifications help in organizing and searching for patents.
 Legal Status: Information about the current legal status of the patent, such as whether it is active, expired, or abandoned.
19

A patent document
 Patent Identification: Publication number US
D504889 (publication No) application 29/201636.
 Inventor Information: Steve Jobs and others, with
address.
 Patent Title: ELECTRONIC DEVICE
 Patent Claims: Ornamental design of an electronic
device.
 Filing and Grant Dates: 17/3/2004 to 10/5/2005.
 Citations: D 345346
 Assignee Information: Apple.
 Technology Classification: D 14/374 ….
 Legal Status: Granted
20

Are patent data Biased?
 Data collection Bias: data are collected forexaminers use thus fields as
address, names etc are not standardized or even collected
 Rapresentation Bias: inventors are not representative of the whole
population
 Overgeneralization Bias: innovation proceeds with disruptive changes:
using patent data for predictions could have this bias.
 Know your Bias…
21

Inaccurate Patent Records
1. Patent Title: The patent titles do not accurately describe
the inventions. For example, the patent with the title
"Automated Widget" actually describes a smartphone,
and "Self-Driving Car Tech" is associated with an advanced
toothbrush.
2. Abstract: The abstracts are entirely inaccurate and do not
match the actual content of the patents. For instance, a
vaccine patent has an abstract about a revolutionary
vaccine but is titled "Improved Gadget."
3. Inventor Name: The names of inventors are present but
may not match the actual inventors of the described
inventions. For instance, an inventor may be attributed to
an invention they did not create.
4. Patent Number: The patent numbers are correctly
formatted, but the information associated with them is
inaccurate, making it difficult to trust the dataset for
research or intellectual property purposes.
5. Filing Date: The filing dates appear to be accurate, but
they are not linked to the inventions' descriptions, further
contributing to data inaccuracy.
22
Patent Number Inventor Name Filing Date Patent Title Abstract
US1234567A John Smith 15/05/2020
Automated
Widget
A new type
of
smartphone
.
EP7890123B Mary Johnson 20/08/2019
Self-Driving Car
Tech
An
advanced
toothbrush
design.
JP4567890C Robert Brown 10/06/2021Improved Gadget
Revolutiona
ry vaccine
for a rare
disease.
CN2345678U Jane Doe 10/12/2021
Mobile App
Interface
A recipe for
homemade
cookies.
US3456789B Samantha Lee 25/03/2022
Smart Device
Control
High-speed
internet
router.

Completeness
 Completeness assesses whether all required data elements are
present in a dataset. Incomplete data can lead to biased or
incorrect analyses.
• Verifying that all necessary fields are populated.
• Handling missing data through imputation or data collection.
• Ensuring data records are not missing critical information.
23

Incomplete Patents dataset
US1234567A John Smith 15/05/2020
Automated
Widget
JP4567890C Robert Brown
Improved
Gadget
CN2345678U 10/12/2021
Mobile App
Interface
Smart Device
Control
24
- Patent Title: Several patents are missing titles,
making it challenging to understand the subject
matter of these inventions.
- Abstract: None of the patents have abstracts,
which are typically brief summaries of the inventions.
Without abstracts, it's difficult to get a quick overview of
what each patent covers.
- Filing Date: The filing date is missing for one patent and
the grant date is missing for all of them, which is essential
for understanding the patent's lifecycle and history.
- Inventor Name: Some inventor names are missing,
making it unclear who the inventors of these patents are.
- Patent Number: While patent numbers are present,
they are incomplete, and they lack any standardized
format, which can make it challenging to cross-reference
or uniquely identify the patents.
Are there missing records???
This is the most difficult error
to spot.

Consistency
 Consistency involves maintaining uniformity and coherence
in data across different sources, formats, and time periods.
• Checking for consistency in data formats and units of
measurement.
• Resolving discrepancies in data values between sources.
• Establishing data standards to ensure consistency.
25

Inconsistent Patent Records
1. Patent Title: The patent titles lack consistency. For
example, similar inventions have different titles (e.g.,
"Automated Widget" vs. "Widget Automation" for the
same concept).
2. Abstract: The abstracts also vary in terms of detail and
clarity. Some are concise and informative, while others
provide vague or incomplete descriptions.
3. Inventor Name: One patent record has no inventor name
specified, and there is a mix of formats for inventor
names (full names vs. partial names).
4. Patent Number: The patent numbers are consistent in
format but may not follow a standardized naming
convention, which can hinder cross-referencing or
matching with external databases.
5. Filing Date: The filing dates are in different date formats,
with some missing filing dates altogether, making it
challenging to establish a clear timeline.
26
US1234567A John Smith 2020.15.03
Automated
Widget
An
automated
widget for
various
applications
.
EP7890123B Mary Johnson
Widget
Automation
Automated
widget for a
range of
purposes.
JP4567890C R Brown 2021 Improved Gadget
An
enhanceme
nt for
gadgets.
CN2345678U 10/12/2021
Mobile App
Interface
Mobile
application
interface.
US3456789B Lee, Samantha 03/25/2022
Smart Device
Control System
Control
system for
smart
devices.

Timeliness
 Timeliness measures whether data is available when needed and up to date
for its intended purpose.
• Setting data refresh intervals to meet business needs.
• Monitoring data pipelines for delays or bottlenecks.
• Ensuring data aligns with the time frame of analyses or decisions.
• Synchronization of data sources.
27

Patent Records with Timeliness Issues
1. Filing Date: All the patents in the dataset have
relatively old filing dates, ranging from 2010 to
2015. This indicates that the dataset is not up to
date and may not include recent inventions or
innovations.
2. Patent Titles and Abstracts: The patent titles and
abstracts describe technologies and inventions
that were relevant and cutting-edge at the time of
filing, but they may not accurately represent the
current state of technology.
3. Applicants: Some names could be changed
(Google vs Alphabet) some acquired (USR is now a
division of UNICOM Global) or closed or even the
patent sold to others.
28
Patent Number Patent Applicant Filing Date Patent Title Abstract
US1234567A
University of
Columbia
15/08/2010
Automated
Widget
An
innovative
widget
design.
EP7890123B US Robotics 20/06/2012
Advanced
Robotics
Cutting-
edge
robotics
technology.
JP4567890C Disney 10/09/2013 Improved Gadget
A more
efficient
gadget
solution.
CN2345678U Google 10/12/2014
Mobile App
Interface
User-
friendly
mobile app
interface.
Smart Device
Control System
High-speed
control
system for
devices.

Relevance
 Relevance assesses whether the data collected is
pertinent and useful for the intended analysis or
decision-making.
• Identifying and eliminating irrelevant data.
• Tailoring data collection efforts to align with specific
goals.
• Regularly reviewing data to ensure it remains relevant
over time.
29

Patent Records with Lack of Relevance
(context: looking for recent stationery innovations)
1. Patent Titles: The patent titles describe
inventions such as paperclips, staplers, pencil
sharpeners, coffee mugs, and smart device
control apps. Some of them are not in our
target
2. Abstracts: The abstracts provide more
detailed information about these inventions,
but they still do not demonstrate a clear
connection to cutting-edge or industry-
relevant technologies.
3. Filing Dates: While some patents are relatively
recent (filed in 2021 or 2022), the inventions
from 2012 and 2015 are notrelevant.
30
US1234567A John Smith 15/05/2015
Improved
Paperclip
A new design for
paperclips.
Enhanced
Stapler
A stapler with
improved
ergonomics.
JP4567890C Robert Brown 10/06/2021
Advanced Pencil
Sharpener
Cutting-edge
pencil sharpener
with safety
features.
CN2345678U Jane Doe 10/12/2021
Innovative
Coffee Mug
A coffee mug
with a built-in
flashlight.
Smart Device
Control
An app for
controlling
smart home
devices.

Validity
 Validity ensures that data conforms to defined rules and
constraints, preventing incorrect or nonsensical values.
• Implementing data validation checks to flag invalid entries.
• Defining and enforcing data constraints.
• Regularly auditing data for validity against predefined criteria.
31

Patent Records with Lack of Validity
Patent Number Inventor Name Filing Date Patent Title References
US1234567A John Smith 20201027
Time Travel
Device
JP20011347A,
EP7890123 Mary Johnson 20.08.2019
Perpetual Motion
Machine
A perpetual motion
machine, a concept
debunked by
physics.
JP-4567890-C Robert Brown 10/28/2021
Anti-Gravity
Shoes
[JP20011347A,
US2016765432
0A]
CN2345678U Jane Doe 10/12/2021Immortality Elixir .
US3456789B1 Samantha Lee 25/03/2022
Mind-Reading
Headset
J J Cale -
Magnolia
32
1.Patent Number: they follow different
standards, and some of them are
incomplete.
2.Filing Date: different formats are
displayed here.
3.References are inconsistent: both in
format (list of patents comma
separated or within []; and also the
content (title, vs number vs author and
title…)

Uniqueness
 Uniqueness measures whether each data entry is distinct and not
duplicated within the dataset.
• Identifying and removing duplicate records.
• Implementing data deduplication processes.
• Ensuring data uniqueness to avoid overcounting or incorrect
analyses.
33

Patent Records with Lack of
Uniqueness
1. Patent Titles: The patent titles could be
not unique (see line 1,3 - this could be
an issue only if we use it as a key field).
2. Patent Number: Lines 2 and 5 have the
same number but refer to totally
different patents.
3. Inventor: We cannot be sure if john
Smith of patent in line 1 is the same of
line 5 (entity disambiguation needed).
34
US1234567A John Smith 15/05/2020
Improved
Light Bulb
An incremental
improvement to
existing light
bulb tech.
US3456789B Mary Johnson 20/08/2019
Enhanced
Umbrella
An umbrella
design with
minor
ergonomic
enhancements.
JP4567890C Robert Brown 10/06/2021
Improved
Light Bulb
A spoon design
featuring minor
ergonomic
changes.
CN2345678U Jane Doe 10/12/2021
Redesigned
Pen
A pen design
with slight
modifications.
US3456789B John Smith 25/03/2022Smart Toaster
A toaster with
minor
technological
upgrades.

Data Documentation
 Metadata Standardization: Standardize metadata formats and definitions to
ensure consistency and clarity across data assets.
 Data Classification: Categorize data assets based on their type, sensitivity, and
purpose to aid in search and access.
 Data Descriptions: Provide detailed descriptions of each data asset, including its
source, usage, and relevance to the organization.
 Data Lineage: Document the data's journey from source to consumption, showing
how it's transformed and used.
35

Data Documentation (II)
 Keyword Tags: Attach relevant keywords and tags to data assets for improved
discoverability.
 Data Dependencies: Highlight relationships and dependencies between different
data assets to provide context.
 Versioning: Maintain version history for data assets to track changes and updates
over time.
 Data Ownership: Clearly indicate data ownership and stewardship responsibilities
for each asset.
36

Logical Model (E-R Diagram)
(relational DBs only)
 EPO PatStat
database
 Tables naming
convention:
TLS999 + content
 Main key across
database: appln_id
patent application
unique id
37

Data Ontologies
 Data ontologies are structured frameworks that define and organize the
concepts, entities, and relationships within a specific domain of knowledge.
They play a crucial role in the field of data science and knowledge
management. Here are a few lines to explain their significance:
 "Data ontologies serve as the building blocks of knowledge representation in
data science. They provide a structured way to define and categorize the
various elements within a particular domain.
 By formalizing the relationships between entities and attributes, data
ontologies enable more effective data integration, semantic reasoning, and
data sharing among different systems and disciplines. They act as a common
language that bridges the gap between human understanding and machine
processing, facilitating better data interoperability and knowledge discovery."
38

FAIR data principles
 https://www.go-fair.org/fair-principles/
 Findability, Accessibility, Interoperability, and Reuse of digital assets
 Findable
The first step in (re)using data is to find them. Metadata and data should
be easy to find for both humans and computers. Machine-readable
metadata are essential for automatic discovery of datasets and services, so
this is an essential component of the FAIRification process.
39

FAIR data principles (II)
 Accessible
Once the user finds the required data, she/he/they need to know how they
can be accessed, possibly including authentication and authorisation.
 Interoperable
The data usually need to be integrated with other data. In addition, the
data need to interoperate with applications or workflows for analysis,
storage, and processing.
 Reusable
The ultimate goal of FAIR is to optimise the reuse of data. To achieve this,
metadata and data should be well-described so that they can be replicated
and/or combined in different settings.
40

Conclusions
 “Each machine has its own, unique personality which probably could be
defined as the intuitive sum total of everything you know and feel about it.
This personality constantly changes, usually for the worse, but sometimes
surprisingly for the better, and it is this personality that is the real object of
motorcycle maintenance.”
42
Robert M. Pirsig
Zen and the Art of Motorcycle Maintenance

Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)

Recommended

Recommended

More Related Content

Similar to Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)

Similar to Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?) (20)

More from Gianluca Tarasconi

More from Gianluca Tarasconi (16)

Recently uploaded

Recently uploaded (20)

Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)