SlideShare a Scribd company logo
Paolo Missier
School of Computing
Newcastle University, UK
IPAW @Provenance Week, July 2021
Quo vadis, provenancer?
Cui prodest?
our own trajectory: provenance of data science pipelines
2
Part I:
Quo vadis?
3
A little bibliometrics
Database: Web of Science Core Collection
TI = ((data or workflow) and provenance)
OR
AB = ((data or workflow) and provenance)
2000-today
>3,000 records
4
Caveat: WoS vs Scopus
A similar query returns about 12,000 records from Scopus:
TITLE-ABS-KEY ( ( data OR workflow ) AND provenance ) AND PUBYEAR > 2000
4,500 after refinement by subject area:
TITLE-ABS-KEY ( ( data OR workflow ) AND provenance ) AND PUBYEAR > 2000
AND ( LIMIT-TO ( SUBJAREA , "COMP" ) OR LIMIT-TO ( SUBJAREA , "MATH" )
OR LIMIT-TO ( SUBJAREA , "ENGI" ) )
5
Disconnect?
big tickets:
geosciences, food science,
…. *science!
6
Most relevant sources (WoS)
7
Highly cited
940
"the pre-tectonic monzogranitic gneisses of the Liaoji granitoids or similar-aged granitoids may have been an important component of
the provenance for the Liaohe Group."
8
500
9
Focus on our own community
Database: Web of Science Core Collection
TI = ((data or workflow) and provenance)
OR
AB = ((data or workflow) and provenance)
2000-today
Query Web of Science
(could have used Scopus)
>3,000 records
Refine by WoS categories
1,500 records
Restrict to 2020-21
120 records
Refined by: WEB OF SCIENCE CATEGORIES: (COMPUTERSCIENCE THEORY METHODSOR COMPUTER SCIENCE INFORMATIONSYSTEMS OR
COMPUTERSCIENCE SOFTWARE ENGINEERINGOR ENGINEERING ELECTRICAL ELECTRONIC OR COMPUTERSCIENCE INTERDISCIPLINARY
APPLICATIONSOR GEOSCIENCESMULTIDISCIPLINARY OR TELECOMMUNICATIONSOR COMPUTERSCIENCE ARTIFICIAL INTELLIGENCE OR
COMPUTERSCIENCE HARDWAREARCHITECTUREOR MATHEMATICAL COMPUTATIONALBIOLOGY OR MEDICAL INFORMATICS)
AND [excluding]: WEB OF SCIENCE CATEGORIES: (REMOTE SENSING OR GEOGRAPHY PHYSICAL OR IMAGING SCIENCE PHOTOGRAPHIC
TECHNOLOGY OR ARCHAEOLOGY OR ASTRONOMYASTROPHYSICSOR BIOCHEMICAL RESEARCH METHODSOR GEOCHEMISTRY GEOPHYSICS
OR ANTHROPOLOGY OR AUTOMATION CONTROLSYSTEMS OR OPERATIONS RESEARCH MANAGEMENTSCIENCE OR COMPUTERSCIENCE
CYBERNETICS OR INFORMATION SCIENCE LIBRARY SCIENCE OR ENGINEERING BIOMEDICAL OR ENGINEERINGMULTIDISCIPLINARY)
AND [excluding]: WEB OF SCIENCE CATEGORIES: (HEALTH CARE SCIENCES SERVICESOR MATERIALSSCIENCE MULTIDISCIPLINARY OR
AGRICULTURE MULTIDISCIPLINARYOR CHEMISTRY MULTIDISCIPLINARYOR ECOLOGY OR EDUCATION SCIENTIFIC DISCIPLINES OR
ENGINEERING CHEMICAL OR ENGINEERING ENVIRONMENTALOR ENVIRONMENTALSCIENCESOR GREEN SUSTAINABLESCIENCE
TECHNOLOGY OR NEUROSCIENCESOR OPTICS OR POLITICAL SCIENCE)AND [excluding]: WEB OF SCIENCE CATEGORIES: (ENERGY FUELS OR
RADIOLOGY NUCLEARMEDICINE MEDICAL IMAGING OR ROBOTICS)AND [excluding]: WEB OF SCIENCE CATEGORIES: (GEOSCIENCES
MULTIDISCIPLINARYOR MINERALOGY OR ENGINEERINGGEOLOGICAL OR CHEMISTRY PHYSICAL OR ENGINEERING PETROLEUM OR
INSTRUMENTSINSTRUMENTATIONOR EVOLUTIONARY BIOLOGY OR CHEMISTRY ANALYTICAL OR HEALTH POLICY SERVICESOR
ENGINEERING CIVIL OR HISTORY PHILOSOPHY OF SCIENCE OR LOGIC OR INTERNATIONAL RELATIONSOR MARINE FRESHWATERBIOLOGY OR
ACOUSTICS OR MULTIDISCIPLINARY SCIENCES ORMINING MINERAL PROCESSINGOR OCEANOGRAPHY OR PHYSICS APPLIED OR MUSIC OR
BIOLOGY OR NANOSCIENCENANOTECHNOLOGY OR GEOLOGY OR ENGINEERINGINDUSTRIAL OR PHARMACOLOGYPHARMACY OR
PALEONTOLOGY OR ENGINEERING MANUFACTURINGOR PHYSICS MULTIDISCIPLINARY OR WATERRESOURCESOR ENGINEERING MARINE OR
PUBLIC ADMINISTRATION OR MEDICALINFORMATICS OR HUMANITIES MULTIDISCIPLINARY OR PUBLIC ENVIRONMENTAL OCCUPATIONAL
HEALTH OR SOCIAL SCIENCESMATHEMATICAL METHODS OR METEOROLOGY ATMOSPHERIC SCIENCES OR BUSINESS OR TRANSPORTATION
SCIENCE TECHNOLOGY OR SOIL SCIENCE OR CONSTRUCTIONBUILDING TECHNOLOGY)
10
Back in the comfort zone
11
WordCloud – abstracts / trigrams
12
“our” TreeMap (Abstracts / bigrams)
15
Word dynamics – abstracts (bi-grams)
16
Thematic evolution – abstracts (bi-grams)
M.J. Cobo, A.G. López-Herrera, E. Herrera-Viedma, F. Herrera, An approach for detecting, quantifying, and visualizing the evolution of a research field: A practical
application to the Fuzzy Sets Theory field, Journal of Informetrics, (5),1, 2011, https://doi.org/10.1016/j.joi.2010.10.002.
17
Thematic evolution – keywords (Scopus)
18
Thematic evolution – keywords (Scopus)
19
Trending topics – abstracts (bi-grams)
20
Trending topics – keywords
21
Trending topics – abstracts 2018-2021
22
Trending topics – titles 2018-2021
23
Trending topics – Scopus
24
Co-occurrence networks
25
Blockchain and provenance – recent papers
Ruan, P., Dinh, T.T.A., Lin, Q. et al. LineageChain: a fine-grained, secure and efficient data provenance system for
blockchains. The VLDB Journal 30, 3–24 (2021). https://doi.org/10.1007/s00778-020-00646-1
we identify and motivate a new class of smart contracts that rely on
provenance information at runtime.
LineageChain exposes lineage information to smart contracts runtime via
interfaces that support provenance-dependent
contracts. LineageChain captures provenance during contract execution […]
Pinna, Andrea, Tonelli, Roberto, Marchesi, Michele, Ibba, Simona, and Baralla, Gavina. "Ensuring Transparency and
Traceability of Food Local Products: A Blockchain Application to a Smart Tourism Region." Concurrency and
Computation : Practice and Experience. 33.1 (2021): Concurrency and Computation : Practice and Experience. , 2021,
Vol.33(1). Web.
Casey, Eoghan, Bourquenoud, Jonathan, and Jaquet-Chiffelle, David-Olivier. "Tamperproof Timestamped Provenance
Ledger Using Blockchain Technology." Forensic Science International: Digital Investigation 33 (2020): 300977. Web.
A. Musamih et al., "A Blockchain-Based Approach for Drug Traceability in Healthcare Supply Chain," in IEEE Access, vol.
9, pp. 9728-9743, 2021, doi: 10.1109/ACCESS.2021.3049920.
Bai, B, Nazir, S, Bai, Y, Anees, A. Security and provenance for Internet of Health Things: A systematic literature review. J
Softw Evol Proc. 2021; 33:e2335. https://doi.org/10.1002/smr.2335
Porkodi, S., Kesavaraja, D. Secure Data Provenance in Internet of Things using Hybrid Attribute based Crypt
Technique. Wireless Pers Commun 118, 2821–2842 (2021). https://doi.org/10.1007/s11277-021-08157-0
26
Collaborations
27
Part II:
Cui prodest?
PROV submitted as a case for “impactful research” to UK REF 2021
28
PROV @ UK REF 2021: NASA
NASA/ USGCRP (US Global Change Research Program)
US Global Change Information System (GCIS) https://data.globalchange.gov/. (Tilmes, Sherman)
PROV is used in the GCIS to enforce the traceability of all of the about 50,000 individual resources held
in the database […]
- Changes in working practice & policy.
- Effect on policy debate provided by transparency and assurance
29
PROV @ UK REF 2021: UK National Archives
UK National Archives (Cresswell)
- Change of working practice as a result of the requirement by the National Archives (NA) to include
provenance. All of Gazette data must now be supported by provenance statements.
- Traceability of legislation data
30
PROV @ UK REF 2021: Astra Zeneca
Astra Zeneca (Plasterer)
The process of adopting PROV along with other ontologies started in 2013 as part of a million-dollar
project, where PROV is estimated to account for about 5-10%, with continued maintenance to date.
- change of working practices, where the use of shared vocabularies now informs data governance
and promotes transparency
- competitive advantage […] CI360 technology is based on “nanopublications” (http://nanopub.org/)
31
PROV @ UK REF 2021: others
https://blogs.ncl.ac.uk/paolomissier/2021/02/07/w3c-prov-some-interesting-extensions-to-the-core-standard/
32
Part III:
Data Provenance for Data Science (DP4DS)
In collaboration with:
Prof. Torlone, Giulia Simonelli, Luca Lauro – Universita’ RomaTre, Italy
Prof. Chapman -- University of Southampton, UK
Adriane Chapman, Paolo Missier, Giulia Simonelli, and Riccardo Torlone. 2020. Capturing and querying fine-grained
provenance of preprocessing pipelines in data science. Proc. VLDB Endow. 14, 4 (December 2020), 507–520.
DOI:https://doi.org/10.14778/3436905.3436911
33
<event
name>
Traceability, explainability, transparency – EU regulations
“Why was my mortgage application refused?” The bias problem originates in the data and its pre-processing!
Article 12 Record-keeping
1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events
(‘logs’) while the high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or
common specifications.
Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels,
21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090
“AI systems that create a high risk to the health and safety or fundamental rights of natural persons/ […] the
classification as high-risk does not only depend on the function performed by the AI system, but also on the specific
purpose and modalities for which that system is used.
- used for the purpose of assessing students
- recruitment or selection of natural persons
- evaluate the eligibility of natural persons for public assistance benefits and services
- evaluate the creditworthiness of natural persons or establish their credit score
- used by law enforcement authorities for making individual risk assessments
34
<event
name>
Can provenance help address the new EU regulations?
Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels,
21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090
Article 12 Record-keeping
2. The logging capabilities shall ensure a level of traceability of the AI system’s functioning throughout its lifecycle that
is appropriate to the intended purpose of the system.
3. In particular, logging capabilities shall enable the monitoring of the operation of the high-risk AI system with respect
to the occurrence of situations that may result in the AI system presenting a risk within the meaning of Article 65(1) or
lead to a substantial modification, and facilitate the post-market monitoring referred to in Article 61.
4. For high-risk AI systems referred to in paragraph 1, point (a) of Annex III, the logging capabilities shall provide, at a
minimum:
(a) recording of the period of each use of the system (start date and time and end date and time of each use);
(b) the reference database against which input data has been checked by the system;
(c) the input data for which the search has led to a match; EN 50 EN
(d) the identification of the natural persons involved in the verification of the results, as referred to in Article 14 (5).
35
M
Data
sources
Acquisition,
wrangling
Test
set
Training
set
Preparing for learning
Model
Selection
Training /
test split
Model
Testing
Model
Learning
Model
Validation
Predictions
Model
Usage
Decision points:
- Source selection
- Sample / population shape
- Cleaning
- Integration
Decision points:
- Sampling / stratification
- Feature selection
- Feature engineering
- Dimensionality reduction
- Regularisation
- Imputation
- Class rebalancing
- …
Provenance
trace
M
Model
Learning
Training
set
Training /
test split
Imputation
Feature
selection
D’ D’’
…
Hyper
parameters
C1 C2
C3
Pipeline structure with provenance annotations
36
<event
name>
Provenance of what?
- Transparent pipeline
- Fine-grained datasets
- Transparent program PT
- Fine-grained datasets
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
- Transparent program PT
- coarse-grained datasets
37
Typical operators used in data prep
38
Operators
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
Data reduction
- Feature selection
- Instance selection
Data augmentation
- Space transformation
- Instance generation
- Encoding (eg one-hot…)
Data transformation
- Data repair
- Binarisation
- Normalisation
- Discretisation
- Imputation
Ex.: vertical augmentation  adding columns
39
Provenance patterns for each operator
40
Provenance templates
Template + binding rules = instantiated provenance fragment
+
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’}
41
This applies to all operators…
42
Making your code provenance-aware
df = pd.DataFrame(…)
# Create a new provenance document
p = pr.Provenance(df, savepath)
# create provanance tracker
tracker=ProvenanceTracker.ProvenanceTracker(df, p)
# instance generation
tracker.df = tracker.df.append({'key2': 'K4'},
ignore_index=True)
# imputation
tracker.df = tracker.df.fillna('imputato')
# feature transformation of column D
tracker.df['D'] = tracker.df['D']*2
# Feature transformation of column key2
tracker.df['key2'] = tracker.df['key2']*2
Idea:
A python tracker object intercepts dataframe
operations
Operations that are channeled through the tracker
generate provenance fragments
43
Semi-automated operator detection
Dataframe shape change.  {add, remove} {columns, rows}
Data value change  single cell  {columns, rows}
Pandas Dataframe tracker
44
Shape change example: one-hot encoding
Regular pandas operators are “observed” by
the tracker
The tracker object should be constantly in sync
with the state of the underlying dataframe
45
Joins
1. Add a second DF to the tracker
2. Specify join keys
3. Perform join
All join variants are supported, but no indexes
46
Join provenance pattern -- keys
Join
activity
wasGeneratedBy
wasInvalidatedBy
Used
Left Right Output
wasInvalidatedBy
Used
wasDerivedFrom
47
Join provenance pattern -- non-key elements
Join
activity
wasGeneratedBy
wasInvalidatedBy
Used
Left Right Output
wasDerivedFrom
48
Putting it all together
49
Performance
Capture: Multiprocessing
- writing operator provenance to disk
- scanning the dataframe
Storage: Compression
Benchmark Queries
1 process / core
50
Multiprocessing – disk writing
One-hot encoding with dataframe sizes:
1. 260K
2. 521K
3. 1.3M
About 70% improvement
51
Multiprocessing – dataframe scanning
Improvement depends on type of operator
About 60% improvement
52
Storage compression
Census dataset:
53
Evaluation - performance
54
Query performance
Results on Census provenance
Query classes:
- All Tranformations: 0.001s
- Feature Operation: 0.001s
- Record Operation: 2.2s
- Item Operation: 0.47s
- Feature Invalidation: 0.004s
- Record Invalidation: 0.26s
- Item Invalidation: 0.028s
56
Summary
1. Is it practical to collect fine-grained provenance?
1. To what extent can it be done automatically?
2. How much does it cost?
2. Is it also useful?  does it help addressing the key questions on high-risk AI systems?
57

More Related Content

What's hot

NG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsNG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsKan Yuenyong
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and Blockchain|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and BlockchainKan Yuenyong
 
Intelligent generator of big data medical
Intelligent generator of big data medicalIntelligent generator of big data medical
Intelligent generator of big data medicalNexgen Technology
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...Big Data Spain
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaginggeetachauhan
 
Deep learning for biomedical discovery and data mining II
Deep learning for biomedical discovery and data mining IIDeep learning for biomedical discovery and data mining II
Deep learning for biomedical discovery and data mining IIDeakin University
 
Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
 
The Amazing Ways Artificial Intelligence Is Transforming Genomics and Gene Ed...
The Amazing Ways Artificial Intelligence Is Transforming Genomics and Gene Ed...The Amazing Ways Artificial Intelligence Is Transforming Genomics and Gene Ed...
The Amazing Ways Artificial Intelligence Is Transforming Genomics and Gene Ed...Bernard Marr
 

What's hot (20)

Cri big data
Cri big dataCri big data
Cri big data
 
Big Data
Big Data Big Data
Big Data
 
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsNG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and Blockchain|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and Blockchain
 
The GDELT project
The GDELT project The GDELT project
The GDELT project
 
Intelligent generator of big data medical
Intelligent generator of big data medicalIntelligent generator of big data medical
Intelligent generator of big data medical
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
 
Collins seattle-2014-final
Collins seattle-2014-finalCollins seattle-2014-final
Collins seattle-2014-final
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
 
Satya Sahoo Thesis Defense
Satya Sahoo Thesis DefenseSatya Sahoo Thesis Defense
Satya Sahoo Thesis Defense
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Data mining
Data mining Data mining
Data mining
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
 
Deep learning for biomedical discovery and data mining II
Deep learning for biomedical discovery and data mining IIDeep learning for biomedical discovery and data mining II
Deep learning for biomedical discovery and data mining II
 
Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
The Amazing Ways Artificial Intelligence Is Transforming Genomics and Gene Ed...
The Amazing Ways Artificial Intelligence Is Transforming Genomics and Gene Ed...The Amazing Ways Artificial Intelligence Is Transforming Genomics and Gene Ed...
The Amazing Ways Artificial Intelligence Is Transforming Genomics and Gene Ed...
 

Similar to Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data science pipelines

D4Science: An e-Infrastructure for Facilitating Fisheries and Aquaculture Re...
D4Science:An e-Infrastructure for Facilitating Fisheries and Aquaculture Re...D4Science:An e-Infrastructure for Facilitating Fisheries and Aquaculture Re...
D4Science: An e-Infrastructure for Facilitating Fisheries and Aquaculture Re...FAO
 
D4science-II Codata
D4science-II CodataD4science-II Codata
D4science-II CodataFAO
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridIan Foster
 
WOW13_RPITWC_Web Observatories
WOW13_RPITWC_Web ObservatoriesWOW13_RPITWC_Web Observatories
WOW13_RPITWC_Web Observatoriesgloriakt
 
AH-XLDBEurope-position-09 jun2011
AH-XLDBEurope-position-09 jun2011AH-XLDBEurope-position-09 jun2011
AH-XLDBEurope-position-09 jun2011Alex Hardisty
 
UK e-Infrastructure for Research - UK/USA HPC Workshop, Oxford, July 2015
UK e-Infrastructure for Research - UK/USA HPC Workshop, Oxford, July 2015UK e-Infrastructure for Research - UK/USA HPC Workshop, Oxford, July 2015
UK e-Infrastructure for Research - UK/USA HPC Workshop, Oxford, July 2015Martin Hamilton
 
Understanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceUnderstanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceAndrew Sallans
 
High Performance Cyberinfrastructure to Support Data-Intensive Biomedical Res...
High Performance Cyberinfrastructure to Support Data-Intensive Biomedical Res...High Performance Cyberinfrastructure to Support Data-Intensive Biomedical Res...
High Performance Cyberinfrastructure to Support Data-Intensive Biomedical Res...Larry Smarr
 
The Developing Needs for e-infrastructures
The Developing Needs for e-infrastructuresThe Developing Needs for e-infrastructures
The Developing Needs for e-infrastructuresguest0dc425
 
AGIT 2015 - Keynote M.Hauswirth: "Linking Everything"
AGIT 2015 - Keynote M.Hauswirth: "Linking Everything" AGIT 2015 - Keynote M.Hauswirth: "Linking Everything"
AGIT 2015 - Keynote M.Hauswirth: "Linking Everything" jstrobl
 
Deep learning for large scale biodiversity monitoring
Deep learning for large scale biodiversity monitoringDeep learning for large scale biodiversity monitoring
Deep learning for large scale biodiversity monitoringGreenapps&web
 
Preprint-WCMRI,IFERP,Singapore,28 October 2022.pdf
Preprint-WCMRI,IFERP,Singapore,28 October 2022.pdfPreprint-WCMRI,IFERP,Singapore,28 October 2022.pdf
Preprint-WCMRI,IFERP,Singapore,28 October 2022.pdfChristo Ananth
 
Mexico talk foster march 2012
Mexico talk foster march 2012Mexico talk foster march 2012
Mexico talk foster march 2012Ian Foster
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects Carole Goble
 
Grid Computing July 2009
Grid Computing July 2009Grid Computing July 2009
Grid Computing July 2009Ian Foster
 
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific EndeavourBeyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific EndeavourKNOWeSCAPE2014
 

Similar to Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data science pipelines (20)

Sinnott Paper
Sinnott PaperSinnott Paper
Sinnott Paper
 
D4Science: An e-Infrastructure for Facilitating Fisheries and Aquaculture Re...
D4Science:An e-Infrastructure for Facilitating Fisheries and Aquaculture Re...D4Science:An e-Infrastructure for Facilitating Fisheries and Aquaculture Re...
D4Science: An e-Infrastructure for Facilitating Fisheries and Aquaculture Re...
 
D4science-II Codata
D4science-II CodataD4science-II Codata
D4science-II Codata
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 
WOW13_RPITWC_Web Observatories
WOW13_RPITWC_Web ObservatoriesWOW13_RPITWC_Web Observatories
WOW13_RPITWC_Web Observatories
 
AH-XLDBEurope-position-09 jun2011
AH-XLDBEurope-position-09 jun2011AH-XLDBEurope-position-09 jun2011
AH-XLDBEurope-position-09 jun2011
 
UK e-Infrastructure for Research - UK/USA HPC Workshop, Oxford, July 2015
UK e-Infrastructure for Research - UK/USA HPC Workshop, Oxford, July 2015UK e-Infrastructure for Research - UK/USA HPC Workshop, Oxford, July 2015
UK e-Infrastructure for Research - UK/USA HPC Workshop, Oxford, July 2015
 
Ci days notre_dame_april2010
Ci days notre_dame_april2010Ci days notre_dame_april2010
Ci days notre_dame_april2010
 
Understanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceUnderstanding the Big Picture of e-Science
Understanding the Big Picture of e-Science
 
High Performance Cyberinfrastructure to Support Data-Intensive Biomedical Res...
High Performance Cyberinfrastructure to Support Data-Intensive Biomedical Res...High Performance Cyberinfrastructure to Support Data-Intensive Biomedical Res...
High Performance Cyberinfrastructure to Support Data-Intensive Biomedical Res...
 
The Developing Needs for e-infrastructures
The Developing Needs for e-infrastructuresThe Developing Needs for e-infrastructures
The Developing Needs for e-infrastructures
 
AGIT 2015 - Keynote M.Hauswirth: "Linking Everything"
AGIT 2015 - Keynote M.Hauswirth: "Linking Everything" AGIT 2015 - Keynote M.Hauswirth: "Linking Everything"
AGIT 2015 - Keynote M.Hauswirth: "Linking Everything"
 
Deep learning for large scale biodiversity monitoring
Deep learning for large scale biodiversity monitoringDeep learning for large scale biodiversity monitoring
Deep learning for large scale biodiversity monitoring
 
Preprint-WCMRI,IFERP,Singapore,28 October 2022.pdf
Preprint-WCMRI,IFERP,Singapore,28 October 2022.pdfPreprint-WCMRI,IFERP,Singapore,28 October 2022.pdf
Preprint-WCMRI,IFERP,Singapore,28 October 2022.pdf
 
Mexico talk foster march 2012
Mexico talk foster march 2012Mexico talk foster march 2012
Mexico talk foster march 2012
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects
 
Grid Computing July 2009
Grid Computing July 2009Grid Computing July 2009
Grid Computing July 2009
 
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific EndeavourBeyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
 
SomeSlides
SomeSlidesSomeSlides
SomeSlides
 
20200901 ECCB M. Kutmon
20200901 ECCB M. Kutmon20200901 ECCB M. Kutmon
20200901 ECCB M. Kutmon
 

More from Paolo Missier

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationPaolo Missier
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
 

More from Paolo Missier (20)

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-Computation
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
 

Recently uploaded

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backElena Simperl
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Product School
 
The architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdfThe architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdfalexjohnson7307
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfChristopherTHyatt
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀DianaGray10
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Thierry Lestable
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 

Recently uploaded (20)

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
The architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdfThe architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdf
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 

Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data science pipelines

  • 1. Paolo Missier School of Computing Newcastle University, UK IPAW @Provenance Week, July 2021 Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data science pipelines
  • 3. 3 A little bibliometrics Database: Web of Science Core Collection TI = ((data or workflow) and provenance) OR AB = ((data or workflow) and provenance) 2000-today >3,000 records
  • 4. 4 Caveat: WoS vs Scopus A similar query returns about 12,000 records from Scopus: TITLE-ABS-KEY ( ( data OR workflow ) AND provenance ) AND PUBYEAR > 2000 4,500 after refinement by subject area: TITLE-ABS-KEY ( ( data OR workflow ) AND provenance ) AND PUBYEAR > 2000 AND ( LIMIT-TO ( SUBJAREA , "COMP" ) OR LIMIT-TO ( SUBJAREA , "MATH" ) OR LIMIT-TO ( SUBJAREA , "ENGI" ) )
  • 7. 7 Highly cited 940 "the pre-tectonic monzogranitic gneisses of the Liaoji granitoids or similar-aged granitoids may have been an important component of the provenance for the Liaohe Group."
  • 9. 9 Focus on our own community Database: Web of Science Core Collection TI = ((data or workflow) and provenance) OR AB = ((data or workflow) and provenance) 2000-today Query Web of Science (could have used Scopus) >3,000 records Refine by WoS categories 1,500 records Restrict to 2020-21 120 records Refined by: WEB OF SCIENCE CATEGORIES: (COMPUTERSCIENCE THEORY METHODSOR COMPUTER SCIENCE INFORMATIONSYSTEMS OR COMPUTERSCIENCE SOFTWARE ENGINEERINGOR ENGINEERING ELECTRICAL ELECTRONIC OR COMPUTERSCIENCE INTERDISCIPLINARY APPLICATIONSOR GEOSCIENCESMULTIDISCIPLINARY OR TELECOMMUNICATIONSOR COMPUTERSCIENCE ARTIFICIAL INTELLIGENCE OR COMPUTERSCIENCE HARDWAREARCHITECTUREOR MATHEMATICAL COMPUTATIONALBIOLOGY OR MEDICAL INFORMATICS) AND [excluding]: WEB OF SCIENCE CATEGORIES: (REMOTE SENSING OR GEOGRAPHY PHYSICAL OR IMAGING SCIENCE PHOTOGRAPHIC TECHNOLOGY OR ARCHAEOLOGY OR ASTRONOMYASTROPHYSICSOR BIOCHEMICAL RESEARCH METHODSOR GEOCHEMISTRY GEOPHYSICS OR ANTHROPOLOGY OR AUTOMATION CONTROLSYSTEMS OR OPERATIONS RESEARCH MANAGEMENTSCIENCE OR COMPUTERSCIENCE CYBERNETICS OR INFORMATION SCIENCE LIBRARY SCIENCE OR ENGINEERING BIOMEDICAL OR ENGINEERINGMULTIDISCIPLINARY) AND [excluding]: WEB OF SCIENCE CATEGORIES: (HEALTH CARE SCIENCES SERVICESOR MATERIALSSCIENCE MULTIDISCIPLINARY OR AGRICULTURE MULTIDISCIPLINARYOR CHEMISTRY MULTIDISCIPLINARYOR ECOLOGY OR EDUCATION SCIENTIFIC DISCIPLINES OR ENGINEERING CHEMICAL OR ENGINEERING ENVIRONMENTALOR ENVIRONMENTALSCIENCESOR GREEN SUSTAINABLESCIENCE TECHNOLOGY OR NEUROSCIENCESOR OPTICS OR POLITICAL SCIENCE)AND [excluding]: WEB OF SCIENCE CATEGORIES: (ENERGY FUELS OR RADIOLOGY NUCLEARMEDICINE MEDICAL IMAGING OR ROBOTICS)AND [excluding]: WEB OF SCIENCE CATEGORIES: (GEOSCIENCES MULTIDISCIPLINARYOR MINERALOGY OR ENGINEERINGGEOLOGICAL OR CHEMISTRY PHYSICAL OR ENGINEERING PETROLEUM OR INSTRUMENTSINSTRUMENTATIONOR EVOLUTIONARY BIOLOGY OR CHEMISTRY ANALYTICAL OR HEALTH POLICY SERVICESOR ENGINEERING CIVIL OR HISTORY PHILOSOPHY OF SCIENCE OR LOGIC OR INTERNATIONAL RELATIONSOR MARINE FRESHWATERBIOLOGY OR ACOUSTICS OR MULTIDISCIPLINARY SCIENCES ORMINING MINERAL PROCESSINGOR OCEANOGRAPHY OR PHYSICS APPLIED OR MUSIC OR BIOLOGY OR NANOSCIENCENANOTECHNOLOGY OR GEOLOGY OR ENGINEERINGINDUSTRIAL OR PHARMACOLOGYPHARMACY OR PALEONTOLOGY OR ENGINEERING MANUFACTURINGOR PHYSICS MULTIDISCIPLINARY OR WATERRESOURCESOR ENGINEERING MARINE OR PUBLIC ADMINISTRATION OR MEDICALINFORMATICS OR HUMANITIES MULTIDISCIPLINARY OR PUBLIC ENVIRONMENTAL OCCUPATIONAL HEALTH OR SOCIAL SCIENCESMATHEMATICAL METHODS OR METEOROLOGY ATMOSPHERIC SCIENCES OR BUSINESS OR TRANSPORTATION SCIENCE TECHNOLOGY OR SOIL SCIENCE OR CONSTRUCTIONBUILDING TECHNOLOGY)
  • 10. 10 Back in the comfort zone
  • 13. 15 Word dynamics – abstracts (bi-grams)
  • 14. 16 Thematic evolution – abstracts (bi-grams) M.J. Cobo, A.G. López-Herrera, E. Herrera-Viedma, F. Herrera, An approach for detecting, quantifying, and visualizing the evolution of a research field: A practical application to the Fuzzy Sets Theory field, Journal of Informetrics, (5),1, 2011, https://doi.org/10.1016/j.joi.2010.10.002.
  • 15. 17 Thematic evolution – keywords (Scopus)
  • 16. 18 Thematic evolution – keywords (Scopus)
  • 17. 19 Trending topics – abstracts (bi-grams)
  • 19. 21 Trending topics – abstracts 2018-2021
  • 20. 22 Trending topics – titles 2018-2021
  • 23. 25 Blockchain and provenance – recent papers Ruan, P., Dinh, T.T.A., Lin, Q. et al. LineageChain: a fine-grained, secure and efficient data provenance system for blockchains. The VLDB Journal 30, 3–24 (2021). https://doi.org/10.1007/s00778-020-00646-1 we identify and motivate a new class of smart contracts that rely on provenance information at runtime. LineageChain exposes lineage information to smart contracts runtime via interfaces that support provenance-dependent contracts. LineageChain captures provenance during contract execution […] Pinna, Andrea, Tonelli, Roberto, Marchesi, Michele, Ibba, Simona, and Baralla, Gavina. "Ensuring Transparency and Traceability of Food Local Products: A Blockchain Application to a Smart Tourism Region." Concurrency and Computation : Practice and Experience. 33.1 (2021): Concurrency and Computation : Practice and Experience. , 2021, Vol.33(1). Web. Casey, Eoghan, Bourquenoud, Jonathan, and Jaquet-Chiffelle, David-Olivier. "Tamperproof Timestamped Provenance Ledger Using Blockchain Technology." Forensic Science International: Digital Investigation 33 (2020): 300977. Web. A. Musamih et al., "A Blockchain-Based Approach for Drug Traceability in Healthcare Supply Chain," in IEEE Access, vol. 9, pp. 9728-9743, 2021, doi: 10.1109/ACCESS.2021.3049920. Bai, B, Nazir, S, Bai, Y, Anees, A. Security and provenance for Internet of Health Things: A systematic literature review. J Softw Evol Proc. 2021; 33:e2335. https://doi.org/10.1002/smr.2335 Porkodi, S., Kesavaraja, D. Secure Data Provenance in Internet of Things using Hybrid Attribute based Crypt Technique. Wireless Pers Commun 118, 2821–2842 (2021). https://doi.org/10.1007/s11277-021-08157-0
  • 25. 27 Part II: Cui prodest? PROV submitted as a case for “impactful research” to UK REF 2021
  • 26. 28 PROV @ UK REF 2021: NASA NASA/ USGCRP (US Global Change Research Program) US Global Change Information System (GCIS) https://data.globalchange.gov/. (Tilmes, Sherman) PROV is used in the GCIS to enforce the traceability of all of the about 50,000 individual resources held in the database […] - Changes in working practice & policy. - Effect on policy debate provided by transparency and assurance
  • 27. 29 PROV @ UK REF 2021: UK National Archives UK National Archives (Cresswell) - Change of working practice as a result of the requirement by the National Archives (NA) to include provenance. All of Gazette data must now be supported by provenance statements. - Traceability of legislation data
  • 28. 30 PROV @ UK REF 2021: Astra Zeneca Astra Zeneca (Plasterer) The process of adopting PROV along with other ontologies started in 2013 as part of a million-dollar project, where PROV is estimated to account for about 5-10%, with continued maintenance to date. - change of working practices, where the use of shared vocabularies now informs data governance and promotes transparency - competitive advantage […] CI360 technology is based on “nanopublications” (http://nanopub.org/)
  • 29. 31 PROV @ UK REF 2021: others https://blogs.ncl.ac.uk/paolomissier/2021/02/07/w3c-prov-some-interesting-extensions-to-the-core-standard/
  • 30. 32 Part III: Data Provenance for Data Science (DP4DS) In collaboration with: Prof. Torlone, Giulia Simonelli, Luca Lauro – Universita’ RomaTre, Italy Prof. Chapman -- University of Southampton, UK Adriane Chapman, Paolo Missier, Giulia Simonelli, and Riccardo Torlone. 2020. Capturing and querying fine-grained provenance of preprocessing pipelines in data science. Proc. VLDB Endow. 14, 4 (December 2020), 507–520. DOI:https://doi.org/10.14778/3436905.3436911
  • 31. 33 <event name> Traceability, explainability, transparency – EU regulations “Why was my mortgage application refused?” The bias problem originates in the data and its pre-processing! Article 12 Record-keeping 1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events (‘logs’) while the high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or common specifications. Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels, 21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090 “AI systems that create a high risk to the health and safety or fundamental rights of natural persons/ […] the classification as high-risk does not only depend on the function performed by the AI system, but also on the specific purpose and modalities for which that system is used. - used for the purpose of assessing students - recruitment or selection of natural persons - evaluate the eligibility of natural persons for public assistance benefits and services - evaluate the creditworthiness of natural persons or establish their credit score - used by law enforcement authorities for making individual risk assessments
  • 32. 34 <event name> Can provenance help address the new EU regulations? Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels, 21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090 Article 12 Record-keeping 2. The logging capabilities shall ensure a level of traceability of the AI system’s functioning throughout its lifecycle that is appropriate to the intended purpose of the system. 3. In particular, logging capabilities shall enable the monitoring of the operation of the high-risk AI system with respect to the occurrence of situations that may result in the AI system presenting a risk within the meaning of Article 65(1) or lead to a substantial modification, and facilitate the post-market monitoring referred to in Article 61. 4. For high-risk AI systems referred to in paragraph 1, point (a) of Annex III, the logging capabilities shall provide, at a minimum: (a) recording of the period of each use of the system (start date and time and end date and time of each use); (b) the reference database against which input data has been checked by the system; (c) the input data for which the search has led to a match; EN 50 EN (d) the identification of the natural persons involved in the verification of the results, as referred to in Article 14 (5).
  • 33. 35 M Data sources Acquisition, wrangling Test set Training set Preparing for learning Model Selection Training / test split Model Testing Model Learning Model Validation Predictions Model Usage Decision points: - Source selection - Sample / population shape - Cleaning - Integration Decision points: - Sampling / stratification - Feature selection - Feature engineering - Dimensionality reduction - Regularisation - Imputation - Class rebalancing - … Provenance trace M Model Learning Training set Training / test split Imputation Feature selection D’ D’’ … Hyper parameters C1 C2 C3 Pipeline structure with provenance annotations
  • 34. 36 <event name> Provenance of what? - Transparent pipeline - Fine-grained datasets - Transparent program PT - Fine-grained datasets Base case: - opaque program Po - coarse-grained dataset Default provenance: - Every output depends on every input - Transparent program PT - coarse-grained datasets
  • 36. 38 Operators 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op Data reduction - Feature selection - Instance selection Data augmentation - Space transformation - Instance generation - Encoding (eg one-hot…) Data transformation - Data repair - Binarisation - Normalisation - Discretisation - Imputation Ex.: vertical augmentation  adding columns
  • 37. 39 Provenance patterns for each operator
  • 38. 40 Provenance templates Template + binding rules = instantiated provenance fragment + 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V}  {new values: F’, J, V’}
  • 39. 41 This applies to all operators…
  • 40. 42 Making your code provenance-aware df = pd.DataFrame(…) # Create a new provenance document p = pr.Provenance(df, savepath) # create provanance tracker tracker=ProvenanceTracker.ProvenanceTracker(df, p) # instance generation tracker.df = tracker.df.append({'key2': 'K4'}, ignore_index=True) # imputation tracker.df = tracker.df.fillna('imputato') # feature transformation of column D tracker.df['D'] = tracker.df['D']*2 # Feature transformation of column key2 tracker.df['key2'] = tracker.df['key2']*2 Idea: A python tracker object intercepts dataframe operations Operations that are channeled through the tracker generate provenance fragments
  • 41. 43 Semi-automated operator detection Dataframe shape change.  {add, remove} {columns, rows} Data value change  single cell  {columns, rows} Pandas Dataframe tracker
  • 42. 44 Shape change example: one-hot encoding Regular pandas operators are “observed” by the tracker The tracker object should be constantly in sync with the state of the underlying dataframe
  • 43. 45 Joins 1. Add a second DF to the tracker 2. Specify join keys 3. Perform join All join variants are supported, but no indexes
  • 44. 46 Join provenance pattern -- keys Join activity wasGeneratedBy wasInvalidatedBy Used Left Right Output wasInvalidatedBy Used wasDerivedFrom
  • 45. 47 Join provenance pattern -- non-key elements Join activity wasGeneratedBy wasInvalidatedBy Used Left Right Output wasDerivedFrom
  • 46. 48 Putting it all together
  • 47. 49 Performance Capture: Multiprocessing - writing operator provenance to disk - scanning the dataframe Storage: Compression Benchmark Queries 1 process / core
  • 48. 50 Multiprocessing – disk writing One-hot encoding with dataframe sizes: 1. 260K 2. 521K 3. 1.3M About 70% improvement
  • 49. 51 Multiprocessing – dataframe scanning Improvement depends on type of operator About 60% improvement
  • 52. 54 Query performance Results on Census provenance Query classes: - All Tranformations: 0.001s - Feature Operation: 0.001s - Record Operation: 2.2s - Item Operation: 0.47s - Feature Invalidation: 0.004s - Record Invalidation: 0.26s - Item Invalidation: 0.028s
  • 53. 56 Summary 1. Is it practical to collect fine-grained provenance? 1. To what extent can it be done automatically? 2. How much does it cost? 2. Is it also useful?  does it help addressing the key questions on high-risk AI systems?
  • 54. 57