The document discusses provenance in the context of data science and artificial intelligence. It provides bibliometric data on publications related to data/workflow provenance from 2000 to the present. Recent trends include increased focus on applications in computing and engineering fields. Blockchain is discussed as a method for capturing fine-grained provenance. The document also outlines challenges around explainability, transparency and accountability for high-risk AI systems according to new EU regulations, and argues that provenance techniques may help address these challenges by providing traceability of system functioning and operation monitoring.
PLAI - Acceleration Program for Generative A.I. Startups
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data science pipelines
1. Paolo Missier
School of Computing
Newcastle University, UK
IPAW @Provenance Week, July 2021
Quo vadis, provenancer?
Cui prodest?
our own trajectory: provenance of data science pipelines
3. 3
A little bibliometrics
Database: Web of Science Core Collection
TI = ((data or workflow) and provenance)
OR
AB = ((data or workflow) and provenance)
2000-today
>3,000 records
4. 4
Caveat: WoS vs Scopus
A similar query returns about 12,000 records from Scopus:
TITLE-ABS-KEY ( ( data OR workflow ) AND provenance ) AND PUBYEAR > 2000
4,500 after refinement by subject area:
TITLE-ABS-KEY ( ( data OR workflow ) AND provenance ) AND PUBYEAR > 2000
AND ( LIMIT-TO ( SUBJAREA , "COMP" ) OR LIMIT-TO ( SUBJAREA , "MATH" )
OR LIMIT-TO ( SUBJAREA , "ENGI" ) )
7. 7
Highly cited
940
"the pre-tectonic monzogranitic gneisses of the Liaoji granitoids or similar-aged granitoids may have been an important component of
the provenance for the Liaohe Group."
9. 9
Focus on our own community
Database: Web of Science Core Collection
TI = ((data or workflow) and provenance)
OR
AB = ((data or workflow) and provenance)
2000-today
Query Web of Science
(could have used Scopus)
>3,000 records
Refine by WoS categories
1,500 records
Restrict to 2020-21
120 records
Refined by: WEB OF SCIENCE CATEGORIES: (COMPUTERSCIENCE THEORY METHODSOR COMPUTER SCIENCE INFORMATIONSYSTEMS OR
COMPUTERSCIENCE SOFTWARE ENGINEERINGOR ENGINEERING ELECTRICAL ELECTRONIC OR COMPUTERSCIENCE INTERDISCIPLINARY
APPLICATIONSOR GEOSCIENCESMULTIDISCIPLINARY OR TELECOMMUNICATIONSOR COMPUTERSCIENCE ARTIFICIAL INTELLIGENCE OR
COMPUTERSCIENCE HARDWAREARCHITECTUREOR MATHEMATICAL COMPUTATIONALBIOLOGY OR MEDICAL INFORMATICS)
AND [excluding]: WEB OF SCIENCE CATEGORIES: (REMOTE SENSING OR GEOGRAPHY PHYSICAL OR IMAGING SCIENCE PHOTOGRAPHIC
TECHNOLOGY OR ARCHAEOLOGY OR ASTRONOMYASTROPHYSICSOR BIOCHEMICAL RESEARCH METHODSOR GEOCHEMISTRY GEOPHYSICS
OR ANTHROPOLOGY OR AUTOMATION CONTROLSYSTEMS OR OPERATIONS RESEARCH MANAGEMENTSCIENCE OR COMPUTERSCIENCE
CYBERNETICS OR INFORMATION SCIENCE LIBRARY SCIENCE OR ENGINEERING BIOMEDICAL OR ENGINEERINGMULTIDISCIPLINARY)
AND [excluding]: WEB OF SCIENCE CATEGORIES: (HEALTH CARE SCIENCES SERVICESOR MATERIALSSCIENCE MULTIDISCIPLINARY OR
AGRICULTURE MULTIDISCIPLINARYOR CHEMISTRY MULTIDISCIPLINARYOR ECOLOGY OR EDUCATION SCIENTIFIC DISCIPLINES OR
ENGINEERING CHEMICAL OR ENGINEERING ENVIRONMENTALOR ENVIRONMENTALSCIENCESOR GREEN SUSTAINABLESCIENCE
TECHNOLOGY OR NEUROSCIENCESOR OPTICS OR POLITICAL SCIENCE)AND [excluding]: WEB OF SCIENCE CATEGORIES: (ENERGY FUELS OR
RADIOLOGY NUCLEARMEDICINE MEDICAL IMAGING OR ROBOTICS)AND [excluding]: WEB OF SCIENCE CATEGORIES: (GEOSCIENCES
MULTIDISCIPLINARYOR MINERALOGY OR ENGINEERINGGEOLOGICAL OR CHEMISTRY PHYSICAL OR ENGINEERING PETROLEUM OR
INSTRUMENTSINSTRUMENTATIONOR EVOLUTIONARY BIOLOGY OR CHEMISTRY ANALYTICAL OR HEALTH POLICY SERVICESOR
ENGINEERING CIVIL OR HISTORY PHILOSOPHY OF SCIENCE OR LOGIC OR INTERNATIONAL RELATIONSOR MARINE FRESHWATERBIOLOGY OR
ACOUSTICS OR MULTIDISCIPLINARY SCIENCES ORMINING MINERAL PROCESSINGOR OCEANOGRAPHY OR PHYSICS APPLIED OR MUSIC OR
BIOLOGY OR NANOSCIENCENANOTECHNOLOGY OR GEOLOGY OR ENGINEERINGINDUSTRIAL OR PHARMACOLOGYPHARMACY OR
PALEONTOLOGY OR ENGINEERING MANUFACTURINGOR PHYSICS MULTIDISCIPLINARY OR WATERRESOURCESOR ENGINEERING MARINE OR
PUBLIC ADMINISTRATION OR MEDICALINFORMATICS OR HUMANITIES MULTIDISCIPLINARY OR PUBLIC ENVIRONMENTAL OCCUPATIONAL
HEALTH OR SOCIAL SCIENCESMATHEMATICAL METHODS OR METEOROLOGY ATMOSPHERIC SCIENCES OR BUSINESS OR TRANSPORTATION
SCIENCE TECHNOLOGY OR SOIL SCIENCE OR CONSTRUCTIONBUILDING TECHNOLOGY)
14. 16
Thematic evolution – abstracts (bi-grams)
M.J. Cobo, A.G. López-Herrera, E. Herrera-Viedma, F. Herrera, An approach for detecting, quantifying, and visualizing the evolution of a research field: A practical
application to the Fuzzy Sets Theory field, Journal of Informetrics, (5),1, 2011, https://doi.org/10.1016/j.joi.2010.10.002.
23. 25
Blockchain and provenance – recent papers
Ruan, P., Dinh, T.T.A., Lin, Q. et al. LineageChain: a fine-grained, secure and efficient data provenance system for
blockchains. The VLDB Journal 30, 3–24 (2021). https://doi.org/10.1007/s00778-020-00646-1
we identify and motivate a new class of smart contracts that rely on
provenance information at runtime.
LineageChain exposes lineage information to smart contracts runtime via
interfaces that support provenance-dependent
contracts. LineageChain captures provenance during contract execution […]
Pinna, Andrea, Tonelli, Roberto, Marchesi, Michele, Ibba, Simona, and Baralla, Gavina. "Ensuring Transparency and
Traceability of Food Local Products: A Blockchain Application to a Smart Tourism Region." Concurrency and
Computation : Practice and Experience. 33.1 (2021): Concurrency and Computation : Practice and Experience. , 2021,
Vol.33(1). Web.
Casey, Eoghan, Bourquenoud, Jonathan, and Jaquet-Chiffelle, David-Olivier. "Tamperproof Timestamped Provenance
Ledger Using Blockchain Technology." Forensic Science International: Digital Investigation 33 (2020): 300977. Web.
A. Musamih et al., "A Blockchain-Based Approach for Drug Traceability in Healthcare Supply Chain," in IEEE Access, vol.
9, pp. 9728-9743, 2021, doi: 10.1109/ACCESS.2021.3049920.
Bai, B, Nazir, S, Bai, Y, Anees, A. Security and provenance for Internet of Health Things: A systematic literature review. J
Softw Evol Proc. 2021; 33:e2335. https://doi.org/10.1002/smr.2335
Porkodi, S., Kesavaraja, D. Secure Data Provenance in Internet of Things using Hybrid Attribute based Crypt
Technique. Wireless Pers Commun 118, 2821–2842 (2021). https://doi.org/10.1007/s11277-021-08157-0
26. 28
PROV @ UK REF 2021: NASA
NASA/ USGCRP (US Global Change Research Program)
US Global Change Information System (GCIS) https://data.globalchange.gov/. (Tilmes, Sherman)
PROV is used in the GCIS to enforce the traceability of all of the about 50,000 individual resources held
in the database […]
- Changes in working practice & policy.
- Effect on policy debate provided by transparency and assurance
27. 29
PROV @ UK REF 2021: UK National Archives
UK National Archives (Cresswell)
- Change of working practice as a result of the requirement by the National Archives (NA) to include
provenance. All of Gazette data must now be supported by provenance statements.
- Traceability of legislation data
28. 30
PROV @ UK REF 2021: Astra Zeneca
Astra Zeneca (Plasterer)
The process of adopting PROV along with other ontologies started in 2013 as part of a million-dollar
project, where PROV is estimated to account for about 5-10%, with continued maintenance to date.
- change of working practices, where the use of shared vocabularies now informs data governance
and promotes transparency
- competitive advantage […] CI360 technology is based on “nanopublications” (http://nanopub.org/)
29. 31
PROV @ UK REF 2021: others
https://blogs.ncl.ac.uk/paolomissier/2021/02/07/w3c-prov-some-interesting-extensions-to-the-core-standard/
30. 32
Part III:
Data Provenance for Data Science (DP4DS)
In collaboration with:
Prof. Torlone, Giulia Simonelli, Luca Lauro – Universita’ RomaTre, Italy
Prof. Chapman -- University of Southampton, UK
Adriane Chapman, Paolo Missier, Giulia Simonelli, and Riccardo Torlone. 2020. Capturing and querying fine-grained
provenance of preprocessing pipelines in data science. Proc. VLDB Endow. 14, 4 (December 2020), 507–520.
DOI:https://doi.org/10.14778/3436905.3436911
31. 33
<event
name>
Traceability, explainability, transparency – EU regulations
“Why was my mortgage application refused?” The bias problem originates in the data and its pre-processing!
Article 12 Record-keeping
1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events
(‘logs’) while the high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or
common specifications.
Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels,
21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090
“AI systems that create a high risk to the health and safety or fundamental rights of natural persons/ […] the
classification as high-risk does not only depend on the function performed by the AI system, but also on the specific
purpose and modalities for which that system is used.
- used for the purpose of assessing students
- recruitment or selection of natural persons
- evaluate the eligibility of natural persons for public assistance benefits and services
- evaluate the creditworthiness of natural persons or establish their credit score
- used by law enforcement authorities for making individual risk assessments
32. 34
<event
name>
Can provenance help address the new EU regulations?
Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels,
21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090
Article 12 Record-keeping
2. The logging capabilities shall ensure a level of traceability of the AI system’s functioning throughout its lifecycle that
is appropriate to the intended purpose of the system.
3. In particular, logging capabilities shall enable the monitoring of the operation of the high-risk AI system with respect
to the occurrence of situations that may result in the AI system presenting a risk within the meaning of Article 65(1) or
lead to a substantial modification, and facilitate the post-market monitoring referred to in Article 61.
4. For high-risk AI systems referred to in paragraph 1, point (a) of Annex III, the logging capabilities shall provide, at a
minimum:
(a) recording of the period of each use of the system (start date and time and end date and time of each use);
(b) the reference database against which input data has been checked by the system;
(c) the input data for which the search has led to a match; EN 50 EN
(d) the identification of the natural persons involved in the verification of the results, as referred to in Article 14 (5).
33. 35
M
Data
sources
Acquisition,
wrangling
Test
set
Training
set
Preparing for learning
Model
Selection
Training /
test split
Model
Testing
Model
Learning
Model
Validation
Predictions
Model
Usage
Decision points:
- Source selection
- Sample / population shape
- Cleaning
- Integration
Decision points:
- Sampling / stratification
- Feature selection
- Feature engineering
- Dimensionality reduction
- Regularisation
- Imputation
- Class rebalancing
- …
Provenance
trace
M
Model
Learning
Training
set
Training /
test split
Imputation
Feature
selection
D’ D’’
…
Hyper
parameters
C1 C2
C3
Pipeline structure with provenance annotations
34. 36
<event
name>
Provenance of what?
- Transparent pipeline
- Fine-grained datasets
- Transparent program PT
- Fine-grained datasets
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
- Transparent program PT
- coarse-grained datasets
42. 44
Shape change example: one-hot encoding
Regular pandas operators are “observed” by
the tracker
The tracker object should be constantly in sync
with the state of the underlying dataframe
43. 45
Joins
1. Add a second DF to the tracker
2. Specify join keys
3. Perform join
All join variants are supported, but no indexes
44. 46
Join provenance pattern -- keys
Join
activity
wasGeneratedBy
wasInvalidatedBy
Used
Left Right Output
wasInvalidatedBy
Used
wasDerivedFrom
45. 47
Join provenance pattern -- non-key elements
Join
activity
wasGeneratedBy
wasInvalidatedBy
Used
Left Right Output
wasDerivedFrom
52. 54
Query performance
Results on Census provenance
Query classes:
- All Tranformations: 0.001s
- Feature Operation: 0.001s
- Record Operation: 2.2s
- Item Operation: 0.47s
- Feature Invalidation: 0.004s
- Record Invalidation: 0.26s
- Item Invalidation: 0.028s
53. 56
Summary
1. Is it practical to collect fine-grained provenance?
1. To what extent can it be done automatically?
2. How much does it cost?
2. Is it also useful? does it help addressing the key questions on high-risk AI systems?