SlideShare a Scribd company logo
1 of 43
Of Unicorns, Yetis, and
Error-Free Datasets
(or what is data quality?)
EMIT BOCCONI 9/10/23
Gianluca Tarasconi, CDO, ipQuants AG gt@ipQuants.com
About the author
 Background in Engineering
 18 years in Bocconi research centers (CESPRI, ICRIOS…) as data architect;
 Consultant for EPO, USPTO, WIPO, OECD, World Bank, E&Y, MIUR, MIT, LSE
…
 Publications record: https://scholar.google.it/citations?user=UQQgBXEAAAAJ
 Nowadays Chief Data Officier at ipQuants AG
2
About www.ipQuants.com
 Provides an AI driven digital cockpit (Qthena) for patent attorneys;
 Statistics and powerful indicators on the different phases of patenting
process and lawfirms competitive intelligence.
3
CDO main tasks
1. Data Governance: Develop and enforce data policies, standards, and
procedures to ensure data integrity and compliance.
2. Data Strategy: Craft a strategic vision for data utilization, aligning it with
organizational goals.
3. Data Quality: Oversee data quality management, ensuring accurate and reliable
data for decision-making.
4. Data Privacy & Security: Protect sensitive data through robust security
measures and compliance with data privacy regulations.
5. Data Analytics: Drive data-driven decision-making by promoting data analytics
and insights across the organization.
6. Data Innovation: Foster a culture of innovation by exploring new data
technologies and opportunities.
7. Data Talent: Build and nurture a skilled data team to execute the data strategy
effectively.
4
Error free data are an Utopia
“Utopia is on the horizon. I move two steps
closer; it moves two steps further away. I walk
another ten steps and the horizon runs ten steps
further away. As much as I may walk, I'll never
reach it. So what's the point of utopia? The point
is this: to keep walking.”
Eduardo Galeano
Aiming to error free data does not mean you will
get them but provides you a direction in data
management.
5
A few unicorns exist…
 A gold standard dataset is a meticulously curated and benchmark
collection of data that serves as the definitive reference for evaluating the
accuracy and performance of algorithms, models, or systems. It represents
the highest attainable quality and correctness in data, making it a trusted
baseline for comparisons.
 Gold standard datasets are often painfully annotated or labeled by domain
experts to ensure their reliability.
 They are used in various fields, including machine learning, natural
language processing, and medical research, where precision and validation
are crucial. Researchers and practitioners use gold standard datasets to
measure and improve the quality and effectiveness of their data-driven
solutions.
6
But in every days (data) life:
 Have in place processes of continuous improvement fo data quality;
 Avoid Bias in data;
 Respect existing laws and regulations.
 European legislation on data processing governed by regulation 2016/679
having to process your personal data, it has the duty to verify that they are
lawful, relevant, updated and correct.
7
What is a Bias in data?
 Bias in datasets refers to the presence of systematic and unfair
inaccuracies, favoring specific groups, perspectives, or outcomes, often
reflecting societal prejudices or limitations in data collection.
 This bias can manifest in various ways, including underrepresentation or
overrepresentation of certain demographics, cultural biases in language or
image data, or skewed sampling methods.
 Identifying and mitigating bias in datasets is crucial (know your enemy) for
ensuring fairness, transparency, and equity in data-driven applications and
decision systems. It requires careful data curation, diverse representation,
and ongoing evaluation to address these challenges effectively.
8
Data biases taxonomy
 from https://www.statice.ai/post/data-bias-types
 Selection bias
 Randomization is the process that balances out the effects of uncontrollable
factors - variables in a data set that are not specifically measured and can
compromise results. In data science, selection bias occurs when you have data
that aren’t properly randomized.
 Overgeneralization Bias
 When a person applies something from one event to all future events, it is
overgeneralization.
 See Russel's inductivist Turkey
9
Data biases taxonomy (II) 10
 Reporting Biases
 A reporting bias is the inclusion of only a subset of results in an analysis, which
typically only covers a small fraction of evidence.
 As an example, a sentiment analysis model can be trained to predict whether a
book review on a popular website is positive or negative. The vast majority of
reviews in the training data set reflect extreme opinions (reviewers who either
adored or despised a book). This was because people were less likely to review a
book they did not feel strongly about. Because of this, the model is less likely to
accurately predict sentiment of reviews that use more subtle language to describe a
book.
Data biases taxonomy (III)
 Group Attribution Biases
 Group attribution biases refer to the human tendency to assume that an
individual's characteristics are always determined by the beliefs of the group.
The group attribution bias manifests itself when you give preference to your
own group (in-group bias) or when you stereotype members of groups you
don't belong to (out-group bias).
 For example, engineers might be predisposed to believe that applicants who
attended the same school as they did are better qualified for a job when
training a résumé-screening model for software developers.
 Implicit Biases
 Implicit biases occur when we make assumptions based on our personal
experiences. Implicit bias manifests itself as attitudes and stereotypes we hold
about others, even when we are unaware of it.
11
What is quality
“Quality...you know what it is, yet you don’t know what it is. But
that’s self-contradictory. But some things are better than others,
that is, they have more quality. But when you try to say what the
quality is, apart from the things that have it, it all goes poof! […]
If no one knows what it is, then for all practical purposes it
doesn’t exist at all. But for all practical purposes it really does
exist.”
Robert M. Pirsig
Zen and the Art of Motorcycle Maintenance
12
Data governance main principles
 Data Quality: Maintain data accuracy, consistency, and reliability by setting and enforcing data
quality standards.
 Data Ownership: Clearly define roles and responsibilities for data stewardship and ownership
throughout the organization.
 Data Compliance: Ensure adherence to data privacy laws, industry regulations, and internal
policies.
 Data Documentation: Document metadata1, data lineage2, and data definitions to enhance data
understanding and usability.
 Data Lifecycle Management: Define processes for data creation, storage, archiving, and
disposal in line with business needs.
 Data Auditing and Monitoring: Implement access controls and permissions. Establish
monitoring mechanisms to detect and address data issues, anomalies, and breaches.
 Continuous Improvement: Continuously assess and refine data governance processes to adapt
to changing business needs and technological advancements.
 Data Ethics: Consider ethical implications in data usage, ensuring that data-driven decisions
align with ethical standards.
1 = defined as the information that describes and explains data.
2 = Data lineage includes the data origin, what happens to it, and where it moves over time.
13
Know your Enemy…
 Data profiling is the process of examining, analyzing, and creating useful
summaries of data
 Collecting descriptive statistics like min, max, count and sum.
 Collecting data types, length and recurring patterns.
 Tagging data with keywords, descriptions or categories.
 Performing data quality assessment, risk of performing joins on the data.
 Discovering metadata and assessing its accuracy.
14
Python libraries for data profiling
 Data profiling is the process of examining and analyzing data to gain insights into its
structure, quality, completeness, and other characteristics.
 ydata_profiling: is a package for data profiling, that automates and standardizes the
generation of detailed reports, complete with statistics and visualizations;
 Lux: is a Python library that facilitates fast and easy data exploration by automating the
visualization and data analysis process. By simply printing out a dataframe in a Jupyter
notebook, Lux recommends a set of visualizations highlighting interesting trends and
patterns in the dataset.
 DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive
data detection easy. Loading Data with a single command, the library automatically
formats & loads files into a DataFrame. Profiling the Data, the library identifies the
schema, statistics, entities (PII / NPI), and more. Data Profiles can then be used in
downstream applications or reports.
15
Example of data profiling
 Data profiling using ydata_profiling
16
Data Quality Dimensions
• Accuracy
• Completeness
• Consistency
• Timeliness
• Relevance
• Validity
• Uniqueness
17
Accuracy
 Accuracy refers to the degree to which data
accurately represents the real-world entity or event it
is meant to describe. It involves minimizing errors
and discrepancies in data.
• Ensuring data values are correct and precise.
• Eliminating data entry mistakes and inaccuracies.
• Validating data against reliable sources to maintain
accuracy.
18
Examples on patent data
 A patent dataset typically contains a comprehensive collection of information related to patents examined by a particular
patent office or jurisdiction. It serves as a resource for various purposes, including research, innovation, and intellectual
property management. Here's a brief description of the typical content found in a patent dataset:
 Patent Identification: This includes a unique patent identifier, such as a patent number, that distinguishes each patent
record within the dataset.
 Inventor Information: Details about the inventors, including their names, affiliations, and sometimes their contact
information.
 Patent Title: A concise title that describes the invention or technology covered by the patent.
 Abstract: A brief summary of the patent's contents, providing an overview of the patented innovation.
 Patent Claims: The patent claims section outlines the specific legal protections and rights granted by the patent. It defines
the boundaries of the invention.
 Filing and Grant Dates: The dates when the patent application was filed and when the patent was granted, providing a
timeline for the patent's life.
 Assignee Information: Details about the entity or entities to whom the patent rights have been assigned or who hold
ownership of the patent.
 Citations: Information about prior art or related patents that influenced or were cited by the patent in question. This can
provide insights into the innovation's context.
 Technology Classification: Patents are often categorized into specific technology classes or subclasses based on their
subject matter. These classifications help in organizing and searching for patents.
 Legal Status: Information about the current legal status of the patent, such as whether it is active, expired, or abandoned.
19
A patent document
 Patent Identification: Publication number US
D504889 (publication No) application 29/201636.
 Inventor Information: Steve Jobs and others, with
address.
 Patent Title: ELECTRONIC DEVICE
 Patent Claims: Ornamental design of an electronic
device.
 Filing and Grant Dates: 17/3/2004 to 10/5/2005.
 Citations: D 345346
 Assignee Information: Apple.
 Technology Classification: D 14/374 ….
 Legal Status: Granted
20
Are patent data Biased?
 Data collection Bias: data are collected forexaminers use thus fields as
address, names etc are not standardized or even collected
 Rapresentation Bias: inventors are not representative of the whole
population
 Overgeneralization Bias: innovation proceeds with disruptive changes:
using patent data for predictions could have this bias.
 Know your Bias…
21
Inaccurate Patent Records
1. Patent Title: The patent titles do not accurately describe
the inventions. For example, the patent with the title
"Automated Widget" actually describes a smartphone,
and "Self-Driving Car Tech" is associated with an advanced
toothbrush.
2. Abstract: The abstracts are entirely inaccurate and do not
match the actual content of the patents. For instance, a
vaccine patent has an abstract about a revolutionary
vaccine but is titled "Improved Gadget."
3. Inventor Name: The names of inventors are present but
may not match the actual inventors of the described
inventions. For instance, an inventor may be attributed to
an invention they did not create.
4. Patent Number: The patent numbers are correctly
formatted, but the information associated with them is
inaccurate, making it difficult to trust the dataset for
research or intellectual property purposes.
5. Filing Date: The filing dates appear to be accurate, but
they are not linked to the inventions' descriptions, further
contributing to data inaccuracy.
22
Patent Number Inventor Name Filing Date Patent Title Abstract
US1234567A John Smith 15/05/2020
Automated
Widget
A new type
of
smartphone
.
EP7890123B Mary Johnson 20/08/2019
Self-Driving Car
Tech
An
advanced
toothbrush
design.
JP4567890C Robert Brown 10/06/2021Improved Gadget
Revolutiona
ry vaccine
for a rare
disease.
CN2345678U Jane Doe 10/12/2021
Mobile App
Interface
A recipe for
homemade
cookies.
US3456789B Samantha Lee 25/03/2022
Smart Device
Control
High-speed
internet
router.
Completeness
 Completeness assesses whether all required data elements are
present in a dataset. Incomplete data can lead to biased or
incorrect analyses.
• Verifying that all necessary fields are populated.
• Handling missing data through imputation or data collection.
• Ensuring data records are not missing critical information.
23
Incomplete Patents dataset
Patent Number Inventor Name Filing Date Patent Title Abstract
US1234567A John Smith 15/05/2020
Automated
Widget
EP7890123B Mary Johnson 20/08/2019
JP4567890C Robert Brown
Improved
Gadget
CN2345678U 10/12/2021
Mobile App
Interface
US3456789B Samantha Lee 25/03/2022
Smart Device
Control
24
- Patent Title: Several patents are missing titles,
making it challenging to understand the subject
matter of these inventions.
- Abstract: None of the patents have abstracts,
which are typically brief summaries of the inventions.
Without abstracts, it's difficult to get a quick overview of
what each patent covers.
- Filing Date: The filing date is missing for one patent and
the grant date is missing for all of them, which is essential
for understanding the patent's lifecycle and history.
- Inventor Name: Some inventor names are missing,
making it unclear who the inventors of these patents are.
- Patent Number: While patent numbers are present,
they are incomplete, and they lack any standardized
format, which can make it challenging to cross-reference
or uniquely identify the patents.
Are there missing records???
This is the most difficult error
to spot.
Consistency
 Consistency involves maintaining uniformity and coherence
in data across different sources, formats, and time periods.
• Checking for consistency in data formats and units of
measurement.
• Resolving discrepancies in data values between sources.
• Establishing data standards to ensure consistency.
25
Inconsistent Patent Records
1. Patent Title: The patent titles lack consistency. For
example, similar inventions have different titles (e.g.,
"Automated Widget" vs. "Widget Automation" for the
same concept).
2. Abstract: The abstracts also vary in terms of detail and
clarity. Some are concise and informative, while others
provide vague or incomplete descriptions.
3. Inventor Name: One patent record has no inventor name
specified, and there is a mix of formats for inventor
names (full names vs. partial names).
4. Patent Number: The patent numbers are consistent in
format but may not follow a standardized naming
convention, which can hinder cross-referencing or
matching with external databases.
5. Filing Date: The filing dates are in different date formats,
with some missing filing dates altogether, making it
challenging to establish a clear timeline.
26
Patent Number Inventor Name Filing Date Patent Title Abstract
US1234567A John Smith 2020.15.03
Automated
Widget
An
automated
widget for
various
applications
.
EP7890123B Mary Johnson
Widget
Automation
Automated
widget for a
range of
purposes.
JP4567890C R Brown 2021 Improved Gadget
An
enhanceme
nt for
gadgets.
CN2345678U 10/12/2021
Mobile App
Interface
Mobile
application
interface.
US3456789B Lee, Samantha 03/25/2022
Smart Device
Control System
Control
system for
smart
devices.
Timeliness
 Timeliness measures whether data is available when needed and up to date
for its intended purpose.
• Setting data refresh intervals to meet business needs.
• Monitoring data pipelines for delays or bottlenecks.
• Ensuring data aligns with the time frame of analyses or decisions.
• Synchronization of data sources.
27
Patent Records with Timeliness Issues
1. Filing Date: All the patents in the dataset have
relatively old filing dates, ranging from 2010 to
2015. This indicates that the dataset is not up to
date and may not include recent inventions or
innovations.
2. Patent Titles and Abstracts: The patent titles and
abstracts describe technologies and inventions
that were relevant and cutting-edge at the time of
filing, but they may not accurately represent the
current state of technology.
3. Applicants: Some names could be changed
(Google vs Alphabet) some acquired (USR is now a
division of UNICOM Global) or closed or even the
patent sold to others.
28
Patent Number Patent Applicant Filing Date Patent Title Abstract
US1234567A
University of
Columbia
15/08/2010
Automated
Widget
An
innovative
widget
design.
EP7890123B US Robotics 20/06/2012
Advanced
Robotics
Cutting-
edge
robotics
technology.
JP4567890C Disney 10/09/2013 Improved Gadget
A more
efficient
gadget
solution.
CN2345678U Google 10/12/2014
Mobile App
Interface
User-
friendly
mobile app
interface.
US3456789B Samantha Lee 25/03/2015
Smart Device
Control System
High-speed
control
system for
devices.
Relevance
 Relevance assesses whether the data collected is
pertinent and useful for the intended analysis or
decision-making.
• Identifying and eliminating irrelevant data.
• Tailoring data collection efforts to align with specific
goals.
• Regularly reviewing data to ensure it remains relevant
over time.
29
Patent Records with Lack of Relevance
(context: looking for recent stationery innovations)
1. Patent Titles: The patent titles describe
inventions such as paperclips, staplers, pencil
sharpeners, coffee mugs, and smart device
control apps. Some of them are not in our
target
2. Abstracts: The abstracts provide more
detailed information about these inventions,
but they still do not demonstrate a clear
connection to cutting-edge or industry-
relevant technologies.
3. Filing Dates: While some patents are relatively
recent (filed in 2021 or 2022), the inventions
from 2012 and 2015 are notrelevant.
30
Patent Number Inventor Name Filing Date Patent Title Abstract
US1234567A John Smith 15/05/2015
Improved
Paperclip
A new design for
paperclips.
EP7890123B Mary Johnson 20/08/2012
Enhanced
Stapler
A stapler with
improved
ergonomics.
JP4567890C Robert Brown 10/06/2021
Advanced Pencil
Sharpener
Cutting-edge
pencil sharpener
with safety
features.
CN2345678U Jane Doe 10/12/2021
Innovative
Coffee Mug
A coffee mug
with a built-in
flashlight.
US3456789B Samantha Lee 25/03/2022
Smart Device
Control
An app for
controlling
smart home
devices.
Validity
 Validity ensures that data conforms to defined rules and
constraints, preventing incorrect or nonsensical values.
• Implementing data validation checks to flag invalid entries.
• Defining and enforcing data constraints.
• Regularly auditing data for validity against predefined criteria.
31
Patent Records with Lack of Validity
Patent Number Inventor Name Filing Date Patent Title References
US1234567A John Smith 20201027
Time Travel
Device
JP20011347A,
EP7890123 Mary Johnson 20.08.2019
Perpetual Motion
Machine
A perpetual motion
machine, a concept
debunked by
physics.
JP-4567890-C Robert Brown 10/28/2021
Anti-Gravity
Shoes
[JP20011347A,
US2016765432
0A]
CN2345678U Jane Doe 10/12/2021Immortality Elixir .
US3456789B1 Samantha Lee 25/03/2022
Mind-Reading
Headset
J J Cale -
Magnolia
32
1.Patent Number: they follow different
standards, and some of them are
incomplete.
2.Filing Date: different formats are
displayed here.
3.References are inconsistent: both in
format (list of patents comma
separated or within []; and also the
content (title, vs number vs author and
title…)
Uniqueness
 Uniqueness measures whether each data entry is distinct and not
duplicated within the dataset.
• Identifying and removing duplicate records.
• Implementing data deduplication processes.
• Ensuring data uniqueness to avoid overcounting or incorrect
analyses.
33
Patent Records with Lack of
Uniqueness
1. Patent Titles: The patent titles could be
not unique (see line 1,3 - this could be
an issue only if we use it as a key field).
2. Patent Number: Lines 2 and 5 have the
same number but refer to totally
different patents.
3. Inventor: We cannot be sure if john
Smith of patent in line 1 is the same of
line 5 (entity disambiguation needed).
34
Patent Number Inventor Name Filing Date Patent Title Abstract
US1234567A John Smith 15/05/2020
Improved
Light Bulb
An incremental
improvement to
existing light
bulb tech.
US3456789B Mary Johnson 20/08/2019
Enhanced
Umbrella
An umbrella
design with
minor
ergonomic
enhancements.
JP4567890C Robert Brown 10/06/2021
Improved
Light Bulb
A spoon design
featuring minor
ergonomic
changes.
CN2345678U Jane Doe 10/12/2021
Redesigned
Pen
A pen design
with slight
modifications.
US3456789B John Smith 25/03/2022Smart Toaster
A toaster with
minor
technological
upgrades.
Data Documentation
 Metadata Standardization: Standardize metadata formats and definitions to
ensure consistency and clarity across data assets.
 Data Classification: Categorize data assets based on their type, sensitivity, and
purpose to aid in search and access.
 Data Descriptions: Provide detailed descriptions of each data asset, including its
source, usage, and relevance to the organization.
 Data Lineage: Document the data's journey from source to consumption, showing
how it's transformed and used.
35
Data Documentation (II)
 Keyword Tags: Attach relevant keywords and tags to data assets for improved
discoverability.
 Data Dependencies: Highlight relationships and dependencies between different
data assets to provide context.
 Versioning: Maintain version history for data assets to track changes and updates
over time.
 Data Ownership: Clearly indicate data ownership and stewardship responsibilities
for each asset.
36
Logical Model (E-R Diagram)
(relational DBs only)
 EPO PatStat
database
 Tables naming
convention:
TLS999 + content
 Main key across
database: appln_id
patent application
unique id
37
Data Ontologies
 Data ontologies are structured frameworks that define and organize the
concepts, entities, and relationships within a specific domain of knowledge.
They play a crucial role in the field of data science and knowledge
management. Here are a few lines to explain their significance:
 "Data ontologies serve as the building blocks of knowledge representation in
data science. They provide a structured way to define and categorize the
various elements within a particular domain.
 By formalizing the relationships between entities and attributes, data
ontologies enable more effective data integration, semantic reasoning, and
data sharing among different systems and disciplines. They act as a common
language that bridges the gap between human understanding and machine
processing, facilitating better data interoperability and knowledge discovery."
38
FAIR data principles
 https://www.go-fair.org/fair-principles/
 Findability, Accessibility, Interoperability, and Reuse of digital assets
 Findable
The first step in (re)using data is to find them. Metadata and data should
be easy to find for both humans and computers. Machine-readable
metadata are essential for automatic discovery of datasets and services, so
this is an essential component of the FAIRification process.
39
FAIR data principles (II)
 Accessible
Once the user finds the required data, she/he/they need to know how they
can be accessed, possibly including authentication and authorisation.
 Interoperable
The data usually need to be integrated with other data. In addition, the
data need to interoperate with applications or workflows for analysis,
storage, and processing.
 Reusable
The ultimate goal of FAIR is to optimise the reuse of data. To achieve this,
metadata and data should be well-described so that they can be replicated
and/or combined in different settings.
40
FAIRification process 41
Conclusions
 “Each machine has its own, unique personality which probably could be
defined as the intuitive sum total of everything you know and feel about it.
This personality constantly changes, usually for the worse, but sometimes
surprisingly for the better, and it is this personality that is the real object of
motorcycle maintenance.”
42
Robert M. Pirsig
Zen and the Art of Motorcycle Maintenance
43

More Related Content

Similar to Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)

Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And FootballAmanda Gray
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfDr. Radhey Shyam
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodKarry Lu
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data scienceJordan Engbers
 
The Role of Community-Driven Data Curation for Enterprises
The Role of Community-Driven Data Curation for EnterprisesThe Role of Community-Driven Data Curation for Enterprises
The Role of Community-Driven Data Curation for EnterprisesEdward Curry
 
Research Method EMBA chapter 12
Research Method EMBA chapter 12Research Method EMBA chapter 12
Research Method EMBA chapter 12Mazhar Poohlah
 
Introduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfIntroduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfGraceOkeke3
 
data minig for eng with all topics and history
data minig for eng with all topics and historydata minig for eng with all topics and history
data minig for eng with all topics and historynbaisane16
 
Data science and data analytics major similarities and distinctions (1)
Data science and data analytics  major similarities and distinctions (1)Data science and data analytics  major similarities and distinctions (1)
Data science and data analytics major similarities and distinctions (1)Robert Smith
 
Regression and correlation
Regression and correlationRegression and correlation
Regression and correlationVrushaliSolanke
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityPrecisely
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfDr. Radhey Shyam
 
Uncover Trends and Patterns with Data Science.pdf
Uncover Trends and Patterns with Data Science.pdfUncover Trends and Patterns with Data Science.pdf
Uncover Trends and Patterns with Data Science.pdfUncodemy
 
Data Integrity Training by Dr. A. Amsavel
Data Integrity Training   by Dr. A. AmsavelData Integrity Training   by Dr. A. Amsavel
Data Integrity Training by Dr. A. AmsavelDr. Amsavel A
 
What could possibly go wrong? - An incomplete guide on how to prevent, detect...
What could possibly go wrong? - An incomplete guide on how to prevent, detect...What could possibly go wrong? - An incomplete guide on how to prevent, detect...
What could possibly go wrong? - An incomplete guide on how to prevent, detect...LeaPetters1
 
What is Data Science?
What is Data Science?What is Data Science?
What is Data Science?Ahmed Banafa
 

Similar to Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?) (20)

Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
week 7.pptx
week 7.pptxweek 7.pptx
week 7.pptx
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
Datascience.pptx
Datascience.pptxDatascience.pptx
Datascience.pptx
 
The Role of Community-Driven Data Curation for Enterprises
The Role of Community-Driven Data Curation for EnterprisesThe Role of Community-Driven Data Curation for Enterprises
The Role of Community-Driven Data Curation for Enterprises
 
Research Method EMBA chapter 12
Research Method EMBA chapter 12Research Method EMBA chapter 12
Research Method EMBA chapter 12
 
Introduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfIntroduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdf
 
data minig for eng with all topics and history
data minig for eng with all topics and historydata minig for eng with all topics and history
data minig for eng with all topics and history
 
Data science and data analytics major similarities and distinctions (1)
Data science and data analytics  major similarities and distinctions (1)Data science and data analytics  major similarities and distinctions (1)
Data science and data analytics major similarities and distinctions (1)
 
Regression and correlation
Regression and correlationRegression and correlation
Regression and correlation
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
Uncover Trends and Patterns with Data Science.pdf
Uncover Trends and Patterns with Data Science.pdfUncover Trends and Patterns with Data Science.pdf
Uncover Trends and Patterns with Data Science.pdf
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Integrity Training by Dr. A. Amsavel
Data Integrity Training   by Dr. A. AmsavelData Integrity Training   by Dr. A. Amsavel
Data Integrity Training by Dr. A. Amsavel
 
What could possibly go wrong? - An incomplete guide on how to prevent, detect...
What could possibly go wrong? - An incomplete guide on how to prevent, detect...What could possibly go wrong? - An incomplete guide on how to prevent, detect...
What could possibly go wrong? - An incomplete guide on how to prevent, detect...
 
What is Data Science?
What is Data Science?What is Data Science?
What is Data Science?
 

More from Gianluca Tarasconi

PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?Gianluca Tarasconi
 
Patents applicants: how to create the full time series
Patents applicants: how to create the full time seriesPatents applicants: how to create the full time series
Patents applicants: how to create the full time seriesGianluca Tarasconi
 
Patstat indicators step by step
Patstat indicators step by stepPatstat indicators step by step
Patstat indicators step by stepGianluca Tarasconi
 
Matching PATSTAT to Crunchbase
Matching PATSTAT to CrunchbaseMatching PATSTAT to Crunchbase
Matching PATSTAT to CrunchbaseGianluca Tarasconi
 
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16Gianluca Tarasconi
 
Ep register for patent data analisys
Ep register for patent data analisysEp register for patent data analisys
Ep register for patent data analisysGianluca Tarasconi
 
Using patstat in universities evaluation procedures
Using patstat in universities evaluation procedures Using patstat in universities evaluation procedures
Using patstat in universities evaluation procedures Gianluca Tarasconi
 
Patstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve themPatstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve themGianluca Tarasconi
 
PRS inpadoc legal data reclassification: db structure and some insights
 PRS inpadoc legal data reclassification: db structure and some insights PRS inpadoc legal data reclassification: db structure and some insights
PRS inpadoc legal data reclassification: db structure and some insightsGianluca Tarasconi
 
Trackin patent applicant changes with a temporal database
Trackin patent applicant changes with a temporal databaseTrackin patent applicant changes with a temporal database
Trackin patent applicant changes with a temporal databaseGianluca Tarasconi
 
Sharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for PatstatSharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for PatstatGianluca Tarasconi
 
Patstat and patstat related resources for patent data analisys
Patstat and patstat related resources for patent data analisysPatstat and patstat related resources for patent data analisys
Patstat and patstat related resources for patent data analisysGianluca Tarasconi
 
Patent databases for business intelligence
Patent databases for business intelligencePatent databases for business intelligence
Patent databases for business intelligenceGianluca Tarasconi
 

More from Gianluca Tarasconi (16)

PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?
 
Patents applicants: how to create the full time series
Patents applicants: how to create the full time seriesPatents applicants: how to create the full time series
Patents applicants: how to create the full time series
 
Patstat indicators step by step
Patstat indicators step by stepPatstat indicators step by step
Patstat indicators step by step
 
Matching PATSTAT to Crunchbase
Matching PATSTAT to CrunchbaseMatching PATSTAT to Crunchbase
Matching PATSTAT to Crunchbase
 
PATSTAT users 7 sins
PATSTAT users 7 sinsPATSTAT users 7 sins
PATSTAT users 7 sins
 
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
 
Ep register for patent data analisys
Ep register for patent data analisysEp register for patent data analisys
Ep register for patent data analisys
 
Using patstat in universities evaluation procedures
Using patstat in universities evaluation procedures Using patstat in universities evaluation procedures
Using patstat in universities evaluation procedures
 
Industria italiana dal 78
Industria italiana dal 78Industria italiana dal 78
Industria italiana dal 78
 
Patenting in the south
Patenting in the southPatenting in the south
Patenting in the south
 
Patstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve themPatstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve them
 
PRS inpadoc legal data reclassification: db structure and some insights
 PRS inpadoc legal data reclassification: db structure and some insights PRS inpadoc legal data reclassification: db structure and some insights
PRS inpadoc legal data reclassification: db structure and some insights
 
Trackin patent applicant changes with a temporal database
Trackin patent applicant changes with a temporal databaseTrackin patent applicant changes with a temporal database
Trackin patent applicant changes with a temporal database
 
Sharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for PatstatSharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for Patstat
 
Patstat and patstat related resources for patent data analisys
Patstat and patstat related resources for patent data analisysPatstat and patstat related resources for patent data analisys
Patstat and patstat related resources for patent data analisys
 
Patent databases for business intelligence
Patent databases for business intelligencePatent databases for business intelligence
Patent databases for business intelligence
 

Recently uploaded

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)

  • 1. Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?) EMIT BOCCONI 9/10/23 Gianluca Tarasconi, CDO, ipQuants AG gt@ipQuants.com
  • 2. About the author  Background in Engineering  18 years in Bocconi research centers (CESPRI, ICRIOS…) as data architect;  Consultant for EPO, USPTO, WIPO, OECD, World Bank, E&Y, MIUR, MIT, LSE …  Publications record: https://scholar.google.it/citations?user=UQQgBXEAAAAJ  Nowadays Chief Data Officier at ipQuants AG 2
  • 3. About www.ipQuants.com  Provides an AI driven digital cockpit (Qthena) for patent attorneys;  Statistics and powerful indicators on the different phases of patenting process and lawfirms competitive intelligence. 3
  • 4. CDO main tasks 1. Data Governance: Develop and enforce data policies, standards, and procedures to ensure data integrity and compliance. 2. Data Strategy: Craft a strategic vision for data utilization, aligning it with organizational goals. 3. Data Quality: Oversee data quality management, ensuring accurate and reliable data for decision-making. 4. Data Privacy & Security: Protect sensitive data through robust security measures and compliance with data privacy regulations. 5. Data Analytics: Drive data-driven decision-making by promoting data analytics and insights across the organization. 6. Data Innovation: Foster a culture of innovation by exploring new data technologies and opportunities. 7. Data Talent: Build and nurture a skilled data team to execute the data strategy effectively. 4
  • 5. Error free data are an Utopia “Utopia is on the horizon. I move two steps closer; it moves two steps further away. I walk another ten steps and the horizon runs ten steps further away. As much as I may walk, I'll never reach it. So what's the point of utopia? The point is this: to keep walking.” Eduardo Galeano Aiming to error free data does not mean you will get them but provides you a direction in data management. 5
  • 6. A few unicorns exist…  A gold standard dataset is a meticulously curated and benchmark collection of data that serves as the definitive reference for evaluating the accuracy and performance of algorithms, models, or systems. It represents the highest attainable quality and correctness in data, making it a trusted baseline for comparisons.  Gold standard datasets are often painfully annotated or labeled by domain experts to ensure their reliability.  They are used in various fields, including machine learning, natural language processing, and medical research, where precision and validation are crucial. Researchers and practitioners use gold standard datasets to measure and improve the quality and effectiveness of their data-driven solutions. 6
  • 7. But in every days (data) life:  Have in place processes of continuous improvement fo data quality;  Avoid Bias in data;  Respect existing laws and regulations.  European legislation on data processing governed by regulation 2016/679 having to process your personal data, it has the duty to verify that they are lawful, relevant, updated and correct. 7
  • 8. What is a Bias in data?  Bias in datasets refers to the presence of systematic and unfair inaccuracies, favoring specific groups, perspectives, or outcomes, often reflecting societal prejudices or limitations in data collection.  This bias can manifest in various ways, including underrepresentation or overrepresentation of certain demographics, cultural biases in language or image data, or skewed sampling methods.  Identifying and mitigating bias in datasets is crucial (know your enemy) for ensuring fairness, transparency, and equity in data-driven applications and decision systems. It requires careful data curation, diverse representation, and ongoing evaluation to address these challenges effectively. 8
  • 9. Data biases taxonomy  from https://www.statice.ai/post/data-bias-types  Selection bias  Randomization is the process that balances out the effects of uncontrollable factors - variables in a data set that are not specifically measured and can compromise results. In data science, selection bias occurs when you have data that aren’t properly randomized.  Overgeneralization Bias  When a person applies something from one event to all future events, it is overgeneralization.  See Russel's inductivist Turkey 9
  • 10. Data biases taxonomy (II) 10  Reporting Biases  A reporting bias is the inclusion of only a subset of results in an analysis, which typically only covers a small fraction of evidence.  As an example, a sentiment analysis model can be trained to predict whether a book review on a popular website is positive or negative. The vast majority of reviews in the training data set reflect extreme opinions (reviewers who either adored or despised a book). This was because people were less likely to review a book they did not feel strongly about. Because of this, the model is less likely to accurately predict sentiment of reviews that use more subtle language to describe a book.
  • 11. Data biases taxonomy (III)  Group Attribution Biases  Group attribution biases refer to the human tendency to assume that an individual's characteristics are always determined by the beliefs of the group. The group attribution bias manifests itself when you give preference to your own group (in-group bias) or when you stereotype members of groups you don't belong to (out-group bias).  For example, engineers might be predisposed to believe that applicants who attended the same school as they did are better qualified for a job when training a résumé-screening model for software developers.  Implicit Biases  Implicit biases occur when we make assumptions based on our personal experiences. Implicit bias manifests itself as attitudes and stereotypes we hold about others, even when we are unaware of it. 11
  • 12. What is quality “Quality...you know what it is, yet you don’t know what it is. But that’s self-contradictory. But some things are better than others, that is, they have more quality. But when you try to say what the quality is, apart from the things that have it, it all goes poof! […] If no one knows what it is, then for all practical purposes it doesn’t exist at all. But for all practical purposes it really does exist.” Robert M. Pirsig Zen and the Art of Motorcycle Maintenance 12
  • 13. Data governance main principles  Data Quality: Maintain data accuracy, consistency, and reliability by setting and enforcing data quality standards.  Data Ownership: Clearly define roles and responsibilities for data stewardship and ownership throughout the organization.  Data Compliance: Ensure adherence to data privacy laws, industry regulations, and internal policies.  Data Documentation: Document metadata1, data lineage2, and data definitions to enhance data understanding and usability.  Data Lifecycle Management: Define processes for data creation, storage, archiving, and disposal in line with business needs.  Data Auditing and Monitoring: Implement access controls and permissions. Establish monitoring mechanisms to detect and address data issues, anomalies, and breaches.  Continuous Improvement: Continuously assess and refine data governance processes to adapt to changing business needs and technological advancements.  Data Ethics: Consider ethical implications in data usage, ensuring that data-driven decisions align with ethical standards. 1 = defined as the information that describes and explains data. 2 = Data lineage includes the data origin, what happens to it, and where it moves over time. 13
  • 14. Know your Enemy…  Data profiling is the process of examining, analyzing, and creating useful summaries of data  Collecting descriptive statistics like min, max, count and sum.  Collecting data types, length and recurring patterns.  Tagging data with keywords, descriptions or categories.  Performing data quality assessment, risk of performing joins on the data.  Discovering metadata and assessing its accuracy. 14
  • 15. Python libraries for data profiling  Data profiling is the process of examining and analyzing data to gain insights into its structure, quality, completeness, and other characteristics.  ydata_profiling: is a package for data profiling, that automates and standardizes the generation of detailed reports, complete with statistics and visualizations;  Lux: is a Python library that facilitates fast and easy data exploration by automating the visualization and data analysis process. By simply printing out a dataframe in a Jupyter notebook, Lux recommends a set of visualizations highlighting interesting trends and patterns in the dataset.  DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy. Loading Data with a single command, the library automatically formats & loads files into a DataFrame. Profiling the Data, the library identifies the schema, statistics, entities (PII / NPI), and more. Data Profiles can then be used in downstream applications or reports. 15
  • 16. Example of data profiling  Data profiling using ydata_profiling 16
  • 17. Data Quality Dimensions • Accuracy • Completeness • Consistency • Timeliness • Relevance • Validity • Uniqueness 17
  • 18. Accuracy  Accuracy refers to the degree to which data accurately represents the real-world entity or event it is meant to describe. It involves minimizing errors and discrepancies in data. • Ensuring data values are correct and precise. • Eliminating data entry mistakes and inaccuracies. • Validating data against reliable sources to maintain accuracy. 18
  • 19. Examples on patent data  A patent dataset typically contains a comprehensive collection of information related to patents examined by a particular patent office or jurisdiction. It serves as a resource for various purposes, including research, innovation, and intellectual property management. Here's a brief description of the typical content found in a patent dataset:  Patent Identification: This includes a unique patent identifier, such as a patent number, that distinguishes each patent record within the dataset.  Inventor Information: Details about the inventors, including their names, affiliations, and sometimes their contact information.  Patent Title: A concise title that describes the invention or technology covered by the patent.  Abstract: A brief summary of the patent's contents, providing an overview of the patented innovation.  Patent Claims: The patent claims section outlines the specific legal protections and rights granted by the patent. It defines the boundaries of the invention.  Filing and Grant Dates: The dates when the patent application was filed and when the patent was granted, providing a timeline for the patent's life.  Assignee Information: Details about the entity or entities to whom the patent rights have been assigned or who hold ownership of the patent.  Citations: Information about prior art or related patents that influenced or were cited by the patent in question. This can provide insights into the innovation's context.  Technology Classification: Patents are often categorized into specific technology classes or subclasses based on their subject matter. These classifications help in organizing and searching for patents.  Legal Status: Information about the current legal status of the patent, such as whether it is active, expired, or abandoned. 19
  • 20. A patent document  Patent Identification: Publication number US D504889 (publication No) application 29/201636.  Inventor Information: Steve Jobs and others, with address.  Patent Title: ELECTRONIC DEVICE  Patent Claims: Ornamental design of an electronic device.  Filing and Grant Dates: 17/3/2004 to 10/5/2005.  Citations: D 345346  Assignee Information: Apple.  Technology Classification: D 14/374 ….  Legal Status: Granted 20
  • 21. Are patent data Biased?  Data collection Bias: data are collected forexaminers use thus fields as address, names etc are not standardized or even collected  Rapresentation Bias: inventors are not representative of the whole population  Overgeneralization Bias: innovation proceeds with disruptive changes: using patent data for predictions could have this bias.  Know your Bias… 21
  • 22. Inaccurate Patent Records 1. Patent Title: The patent titles do not accurately describe the inventions. For example, the patent with the title "Automated Widget" actually describes a smartphone, and "Self-Driving Car Tech" is associated with an advanced toothbrush. 2. Abstract: The abstracts are entirely inaccurate and do not match the actual content of the patents. For instance, a vaccine patent has an abstract about a revolutionary vaccine but is titled "Improved Gadget." 3. Inventor Name: The names of inventors are present but may not match the actual inventors of the described inventions. For instance, an inventor may be attributed to an invention they did not create. 4. Patent Number: The patent numbers are correctly formatted, but the information associated with them is inaccurate, making it difficult to trust the dataset for research or intellectual property purposes. 5. Filing Date: The filing dates appear to be accurate, but they are not linked to the inventions' descriptions, further contributing to data inaccuracy. 22 Patent Number Inventor Name Filing Date Patent Title Abstract US1234567A John Smith 15/05/2020 Automated Widget A new type of smartphone . EP7890123B Mary Johnson 20/08/2019 Self-Driving Car Tech An advanced toothbrush design. JP4567890C Robert Brown 10/06/2021Improved Gadget Revolutiona ry vaccine for a rare disease. CN2345678U Jane Doe 10/12/2021 Mobile App Interface A recipe for homemade cookies. US3456789B Samantha Lee 25/03/2022 Smart Device Control High-speed internet router.
  • 23. Completeness  Completeness assesses whether all required data elements are present in a dataset. Incomplete data can lead to biased or incorrect analyses. • Verifying that all necessary fields are populated. • Handling missing data through imputation or data collection. • Ensuring data records are not missing critical information. 23
  • 24. Incomplete Patents dataset Patent Number Inventor Name Filing Date Patent Title Abstract US1234567A John Smith 15/05/2020 Automated Widget EP7890123B Mary Johnson 20/08/2019 JP4567890C Robert Brown Improved Gadget CN2345678U 10/12/2021 Mobile App Interface US3456789B Samantha Lee 25/03/2022 Smart Device Control 24 - Patent Title: Several patents are missing titles, making it challenging to understand the subject matter of these inventions. - Abstract: None of the patents have abstracts, which are typically brief summaries of the inventions. Without abstracts, it's difficult to get a quick overview of what each patent covers. - Filing Date: The filing date is missing for one patent and the grant date is missing for all of them, which is essential for understanding the patent's lifecycle and history. - Inventor Name: Some inventor names are missing, making it unclear who the inventors of these patents are. - Patent Number: While patent numbers are present, they are incomplete, and they lack any standardized format, which can make it challenging to cross-reference or uniquely identify the patents. Are there missing records??? This is the most difficult error to spot.
  • 25. Consistency  Consistency involves maintaining uniformity and coherence in data across different sources, formats, and time periods. • Checking for consistency in data formats and units of measurement. • Resolving discrepancies in data values between sources. • Establishing data standards to ensure consistency. 25
  • 26. Inconsistent Patent Records 1. Patent Title: The patent titles lack consistency. For example, similar inventions have different titles (e.g., "Automated Widget" vs. "Widget Automation" for the same concept). 2. Abstract: The abstracts also vary in terms of detail and clarity. Some are concise and informative, while others provide vague or incomplete descriptions. 3. Inventor Name: One patent record has no inventor name specified, and there is a mix of formats for inventor names (full names vs. partial names). 4. Patent Number: The patent numbers are consistent in format but may not follow a standardized naming convention, which can hinder cross-referencing or matching with external databases. 5. Filing Date: The filing dates are in different date formats, with some missing filing dates altogether, making it challenging to establish a clear timeline. 26 Patent Number Inventor Name Filing Date Patent Title Abstract US1234567A John Smith 2020.15.03 Automated Widget An automated widget for various applications . EP7890123B Mary Johnson Widget Automation Automated widget for a range of purposes. JP4567890C R Brown 2021 Improved Gadget An enhanceme nt for gadgets. CN2345678U 10/12/2021 Mobile App Interface Mobile application interface. US3456789B Lee, Samantha 03/25/2022 Smart Device Control System Control system for smart devices.
  • 27. Timeliness  Timeliness measures whether data is available when needed and up to date for its intended purpose. • Setting data refresh intervals to meet business needs. • Monitoring data pipelines for delays or bottlenecks. • Ensuring data aligns with the time frame of analyses or decisions. • Synchronization of data sources. 27
  • 28. Patent Records with Timeliness Issues 1. Filing Date: All the patents in the dataset have relatively old filing dates, ranging from 2010 to 2015. This indicates that the dataset is not up to date and may not include recent inventions or innovations. 2. Patent Titles and Abstracts: The patent titles and abstracts describe technologies and inventions that were relevant and cutting-edge at the time of filing, but they may not accurately represent the current state of technology. 3. Applicants: Some names could be changed (Google vs Alphabet) some acquired (USR is now a division of UNICOM Global) or closed or even the patent sold to others. 28 Patent Number Patent Applicant Filing Date Patent Title Abstract US1234567A University of Columbia 15/08/2010 Automated Widget An innovative widget design. EP7890123B US Robotics 20/06/2012 Advanced Robotics Cutting- edge robotics technology. JP4567890C Disney 10/09/2013 Improved Gadget A more efficient gadget solution. CN2345678U Google 10/12/2014 Mobile App Interface User- friendly mobile app interface. US3456789B Samantha Lee 25/03/2015 Smart Device Control System High-speed control system for devices.
  • 29. Relevance  Relevance assesses whether the data collected is pertinent and useful for the intended analysis or decision-making. • Identifying and eliminating irrelevant data. • Tailoring data collection efforts to align with specific goals. • Regularly reviewing data to ensure it remains relevant over time. 29
  • 30. Patent Records with Lack of Relevance (context: looking for recent stationery innovations) 1. Patent Titles: The patent titles describe inventions such as paperclips, staplers, pencil sharpeners, coffee mugs, and smart device control apps. Some of them are not in our target 2. Abstracts: The abstracts provide more detailed information about these inventions, but they still do not demonstrate a clear connection to cutting-edge or industry- relevant technologies. 3. Filing Dates: While some patents are relatively recent (filed in 2021 or 2022), the inventions from 2012 and 2015 are notrelevant. 30 Patent Number Inventor Name Filing Date Patent Title Abstract US1234567A John Smith 15/05/2015 Improved Paperclip A new design for paperclips. EP7890123B Mary Johnson 20/08/2012 Enhanced Stapler A stapler with improved ergonomics. JP4567890C Robert Brown 10/06/2021 Advanced Pencil Sharpener Cutting-edge pencil sharpener with safety features. CN2345678U Jane Doe 10/12/2021 Innovative Coffee Mug A coffee mug with a built-in flashlight. US3456789B Samantha Lee 25/03/2022 Smart Device Control An app for controlling smart home devices.
  • 31. Validity  Validity ensures that data conforms to defined rules and constraints, preventing incorrect or nonsensical values. • Implementing data validation checks to flag invalid entries. • Defining and enforcing data constraints. • Regularly auditing data for validity against predefined criteria. 31
  • 32. Patent Records with Lack of Validity Patent Number Inventor Name Filing Date Patent Title References US1234567A John Smith 20201027 Time Travel Device JP20011347A, EP7890123 Mary Johnson 20.08.2019 Perpetual Motion Machine A perpetual motion machine, a concept debunked by physics. JP-4567890-C Robert Brown 10/28/2021 Anti-Gravity Shoes [JP20011347A, US2016765432 0A] CN2345678U Jane Doe 10/12/2021Immortality Elixir . US3456789B1 Samantha Lee 25/03/2022 Mind-Reading Headset J J Cale - Magnolia 32 1.Patent Number: they follow different standards, and some of them are incomplete. 2.Filing Date: different formats are displayed here. 3.References are inconsistent: both in format (list of patents comma separated or within []; and also the content (title, vs number vs author and title…)
  • 33. Uniqueness  Uniqueness measures whether each data entry is distinct and not duplicated within the dataset. • Identifying and removing duplicate records. • Implementing data deduplication processes. • Ensuring data uniqueness to avoid overcounting or incorrect analyses. 33
  • 34. Patent Records with Lack of Uniqueness 1. Patent Titles: The patent titles could be not unique (see line 1,3 - this could be an issue only if we use it as a key field). 2. Patent Number: Lines 2 and 5 have the same number but refer to totally different patents. 3. Inventor: We cannot be sure if john Smith of patent in line 1 is the same of line 5 (entity disambiguation needed). 34 Patent Number Inventor Name Filing Date Patent Title Abstract US1234567A John Smith 15/05/2020 Improved Light Bulb An incremental improvement to existing light bulb tech. US3456789B Mary Johnson 20/08/2019 Enhanced Umbrella An umbrella design with minor ergonomic enhancements. JP4567890C Robert Brown 10/06/2021 Improved Light Bulb A spoon design featuring minor ergonomic changes. CN2345678U Jane Doe 10/12/2021 Redesigned Pen A pen design with slight modifications. US3456789B John Smith 25/03/2022Smart Toaster A toaster with minor technological upgrades.
  • 35. Data Documentation  Metadata Standardization: Standardize metadata formats and definitions to ensure consistency and clarity across data assets.  Data Classification: Categorize data assets based on their type, sensitivity, and purpose to aid in search and access.  Data Descriptions: Provide detailed descriptions of each data asset, including its source, usage, and relevance to the organization.  Data Lineage: Document the data's journey from source to consumption, showing how it's transformed and used. 35
  • 36. Data Documentation (II)  Keyword Tags: Attach relevant keywords and tags to data assets for improved discoverability.  Data Dependencies: Highlight relationships and dependencies between different data assets to provide context.  Versioning: Maintain version history for data assets to track changes and updates over time.  Data Ownership: Clearly indicate data ownership and stewardship responsibilities for each asset. 36
  • 37. Logical Model (E-R Diagram) (relational DBs only)  EPO PatStat database  Tables naming convention: TLS999 + content  Main key across database: appln_id patent application unique id 37
  • 38. Data Ontologies  Data ontologies are structured frameworks that define and organize the concepts, entities, and relationships within a specific domain of knowledge. They play a crucial role in the field of data science and knowledge management. Here are a few lines to explain their significance:  "Data ontologies serve as the building blocks of knowledge representation in data science. They provide a structured way to define and categorize the various elements within a particular domain.  By formalizing the relationships between entities and attributes, data ontologies enable more effective data integration, semantic reasoning, and data sharing among different systems and disciplines. They act as a common language that bridges the gap between human understanding and machine processing, facilitating better data interoperability and knowledge discovery." 38
  • 39. FAIR data principles  https://www.go-fair.org/fair-principles/  Findability, Accessibility, Interoperability, and Reuse of digital assets  Findable The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process. 39
  • 40. FAIR data principles (II)  Accessible Once the user finds the required data, she/he/they need to know how they can be accessed, possibly including authentication and authorisation.  Interoperable The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.  Reusable The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings. 40
  • 42. Conclusions  “Each machine has its own, unique personality which probably could be defined as the intuitive sum total of everything you know and feel about it. This personality constantly changes, usually for the worse, but sometimes surprisingly for the better, and it is this personality that is the real object of motorcycle maintenance.” 42 Robert M. Pirsig Zen and the Art of Motorcycle Maintenance
  • 43. 43