Prof. Paolo Missier
School of Computing
Newcastle University, UK
May, 2021
Data Provenance for Data Science
In collaboration with:
Prof. Torlone, Giulia Simonelli, Luca Lauro – Universita’ RomaTre, Italy
Prof. Chapman -- University of Southampton, UK
2
Data  Model  Predictions
Model
pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
Data
collection
Instances
Key decisions are made during data selection and
processing:
- Where does the data come from?
- What’s in the dataset?
- What transformations were applied?
3
A concrete example
<event
name>
The classic ”Titanic” dataset: Can you predict survival probabilities?
• Approach: simple logistic regression analysis
Features:
Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name - Name
Sex - Sex
Age - Age
SibSp - Number of Siblings/Spouses Aboard
Parch - Number of Parents/Children Aboard
Ticket - Ticket Number
Fare - Passenger Fare (British pound)
Cabin - Cabin
Embarked - Port of Embarkation (C = Cherbourg; Q =
Queenstown; S = Southampton)
Outcome:
Survived (0 = No; 1 = Yes)
4
<event
name>
Enable analysis of data pre-processing
Is the target class
balanced?
(down / upsample)
Data preparation workflow includes a number of decisions
Dropping
irrelevant
attributes
PassengerId',
'Name',
'Ticket',
'Cabin'
Managing
missing
values
Age missing in 714/891
records
“Pclass is a good
predictor for age”
Impute Age values using
average age for PClass
Dropping correlated
features (?)
Drop
“Fare”, “Pclass”
5
Example: missing values imputation
<event
name>
6
Also: script alludes to human decisions
<event
name>
How do we capture these decisions?
To what extent can they be inferred from code?
7
Correlation analysis
<event
name>
• Is Pclass really a good
predictor for Age?
• Why drop both PClass and
Fare?
1. Dropped Age only
(Nearly identical performance (F1=0.77, 0.76))
2. Use sex, Pclass only
Alternative pre-processing:
8
<event
name>
Also: exploring the effect of alternative pre-processing
D
P1 D1 Learn M1 Predict
x
y1
How can knowledge of P1, P2 help understand why y1 ≠ y2 ?
Ex. Alternative imputation methods for missing values
Ex. Boost minority class / downsample majority class
P2 D2 Learn M2 Predict y2
y1 ≠ y2
9
Some concrete questions
<event
name>
Appropriateness of training set, bias: Is training data fit to learn from?
Appropriateness of preprocessing: where best practices followed?
Debugging / Explaining: output value Y looks wrong, can you tell me how it was produced
Auditing:
• Who was responsible for generating output Y?
• Has any privacy agreement been violated in producing Y?
Access control: access to Y may be restricted based on the derivation history of Y
10
<event
name>
Traceability, explainability, transparency – EU regulations
“Why was my mortgage application refused?” The bias problem originates in the data and its pre-processing!
Article 12 Record-keeping
1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events
(‘logs’) while the high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or
common specifications.
Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels,
21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090
“AI systems that create a high risk to the health and safety or fundamental rights of natural persons/ […] the
classification as high-risk does not only depend on the function performed by the AI system, but also on the specific
purpose and modalities for which that system is used.
- used for the purpose of assessing students
- recruitment or selection of natural persons
- evaluate the eligibility of natural persons for public assistance benefits and services
- evaluate the creditworthiness of natural persons or establish their credit score
- used by law enforcement authorities for making individual risk assessments
12
<event
name>
Provenance
A possible approach to help answer some of the questions:
1. Automatically generate metadata that describes the flow of data through the pipeline as it occurs
2. Persistently store the metadata for each run of the pipeline
3. Map the questions to queries on the metadata store
Data provenance is a structured form of metadata that may fit the purpose
Article 12 Record-keeping
1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events (‘logs’) while the
high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or common specifications.
13
What is provenance?
Oxford English Dictionary:
• the fact of coming from some particular source or quarter; origin, derivation
• the history or pedigree of a work of art, manuscript, rare book, etc.;
• a record of the passage of an item through its various owners: chain of custody
Magna Carta (‘the Great Charter’) was agreed
between King John and his barons on 15 June 1215.
14
The W3C PROV model (2013)
processing
Input 1
Input n
usage
usage
Output 1
Output m
generation
generation
(derivation)
(derivation)
15
The W3C PROV model (2013)
https://www.w3.org/TR/prov-dm/
18
M
Data
sources
Acquisition,
wrangling
Test
set
Training
set
Preparing for learning
Model
Selection
Training /
test split
Model
Testing
Model
Learning
Model
Validation
Predictions
Model
Usage
Decision points:
- Source selection
- Sample / population shape
- Cleaning
- Integration
Decision points:
- Sampling / stratification
- Feature selection
- Feature engineering
- Dimensionality reduction
- Regularisation
- Imputation
- Class rebalancing
- …
Provenance
trace
M
Model
Learning
Training
set
Training /
test split
Imputation
Feature
selection
D’ D’’
…
Hyper
parameters
C1 C2
C3
Pipeline structure with provenance annotations
19
<event
name>
Can provenance help address the new EU regulations?
Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels,
21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090
Article 12 Record-keeping
2. The logging capabilities shall ensure a level of traceability of the AI system’s functioning throughout its lifecycle that
is appropriate to the intended purpose of the system.
3. In particular, logging capabilities shall enable the monitoring of the operation of the high-risk AI system with respect
to the occurrence of situations that may result in the AI system presenting a risk within the meaning of Article 65(1) or
lead to a substantial modification, and facilitate the post-market monitoring referred to in Article 61.
4. For high-risk AI systems referred to in paragraph 1, point (a) of Annex III, the logging capabilities shall provide, at a
minimum:
(a) recording of the period of each use of the system (start date and time and end date and time of each use);
(b) the reference database against which input data has been checked by the system;
(c) the input data for which the search has led to a match; EN 50 EN
(d) the identification of the natural persons involved in the verification of the results, as referred to in Article 14 (5).
20
<event
name>
Provenance of what?
- Transparent pipeline
- Fine-grained datasets
- Transparent program PT
- Fine-grained datasets
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
- Transparent program PT
- coarse-grained datasets
23
Data Provenance for Data Science: technical insight
Technical approach [1]
- Formalisation of provenance patterns for pipeline operators
- Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines
- Demonstration of provenance queries
- Performance analysis
- Collecting provenance incurs space and time overhead
- Performance of provenance queries
[1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier,
P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
24
Pre-processing operators
<event
name>
[1] Berti-Equille L. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In: The World Wide Web Conference on - WWW ’19. New York, New York, USA:
ACM Press; 2019. p. 2580–6.
[1]
[2] García S, Ramírez-Gallego S, Luengo J, Benítez JM, Herrera F. Big data preprocessing: methods and prospects. Big Data Anal. 2016 Dec 1;1(1):9.
[2]
25
Typical operators used in data prep
26
Operators
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
Data reduction
- Feature selection
- Instance selection
Data augmentation
- Space transformation
- Instance generation
- Encoding (eg one-hot…)
Data transformation
- Data repair
- Binarisation
- Normalisation
- Discretisation
- Imputation
Ex.: vertical augmentation  adding columns
27
Making your code provenance-aware
df = pd.DataFrame(…)
# Create a new provenance document
p = pr.Provenance(df, savepath)
# create provanance tracker
tracker=ProvenanceTracker.ProvenanceTracker(df, p)
# instance generation
tracker.df = tracker.df.append({'key2': 'K4'},
ignore_index=True)
# imputation
tracker.df = tracker.df.fillna('imputato')
# feature transformation of column D
tracker.df['D'] = tracker.df['D']*2
# Feature transformation of column key2
tracker.df['key2'] = tracker.df['key2']*2
Idea:
A python tracker object intercepts dataframe
operations
Operations that are channeled through the tracker
generate provenance fragments
28
Provenance patterns
29
Provenance templates
Template + binding rules = instantiated provenance fragment
+
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’}
30
This applies to all operators…
31
Putting it all together
32
Evaluation - performance
33
Evaluation: Provenance capture and query times
34
Scalability
35
Summary
Multiple hypotheses regarding Data Provenance for Data Science:
1. Is it practical to collect fine-grained provenance?
1. To what extent can it be done automatically?
2. How much does it cost?
2. Is it also useful?  does it help addressing the key questions on high-risk AI systems?
Questions?
<event
name>
37
<event
name>
SPARES

Data Provenance for Data Science

  • 1.
    Prof. Paolo Missier Schoolof Computing Newcastle University, UK May, 2021 Data Provenance for Data Science In collaboration with: Prof. Torlone, Giulia Simonelli, Luca Lauro – Universita’ RomaTre, Italy Prof. Chapman -- University of Southampton, UK
  • 2.
    2 Data  Model Predictions Model pre-processing Raw datasets features Predicted you: - Ranking - Score - Class Data collection Instances Key decisions are made during data selection and processing: - Where does the data come from? - What’s in the dataset? - What transformations were applied?
  • 3.
    3 A concrete example <event name> Theclassic ”Titanic” dataset: Can you predict survival probabilities? • Approach: simple logistic regression analysis Features: Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) Name - Name Sex - Sex Age - Age SibSp - Number of Siblings/Spouses Aboard Parch - Number of Parents/Children Aboard Ticket - Ticket Number Fare - Passenger Fare (British pound) Cabin - Cabin Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) Outcome: Survived (0 = No; 1 = Yes)
  • 4.
    4 <event name> Enable analysis ofdata pre-processing Is the target class balanced? (down / upsample) Data preparation workflow includes a number of decisions Dropping irrelevant attributes PassengerId', 'Name', 'Ticket', 'Cabin' Managing missing values Age missing in 714/891 records “Pclass is a good predictor for age” Impute Age values using average age for PClass Dropping correlated features (?) Drop “Fare”, “Pclass”
  • 5.
    5 Example: missing valuesimputation <event name>
  • 6.
    6 Also: script alludesto human decisions <event name> How do we capture these decisions? To what extent can they be inferred from code?
  • 7.
    7 Correlation analysis <event name> • IsPclass really a good predictor for Age? • Why drop both PClass and Fare? 1. Dropped Age only (Nearly identical performance (F1=0.77, 0.76)) 2. Use sex, Pclass only Alternative pre-processing:
  • 8.
    8 <event name> Also: exploring theeffect of alternative pre-processing D P1 D1 Learn M1 Predict x y1 How can knowledge of P1, P2 help understand why y1 ≠ y2 ? Ex. Alternative imputation methods for missing values Ex. Boost minority class / downsample majority class P2 D2 Learn M2 Predict y2 y1 ≠ y2
  • 9.
    9 Some concrete questions <event name> Appropriatenessof training set, bias: Is training data fit to learn from? Appropriateness of preprocessing: where best practices followed? Debugging / Explaining: output value Y looks wrong, can you tell me how it was produced Auditing: • Who was responsible for generating output Y? • Has any privacy agreement been violated in producing Y? Access control: access to Y may be restricted based on the derivation history of Y
  • 10.
    10 <event name> Traceability, explainability, transparency– EU regulations “Why was my mortgage application refused?” The bias problem originates in the data and its pre-processing! Article 12 Record-keeping 1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events (‘logs’) while the high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or common specifications. Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels, 21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090 “AI systems that create a high risk to the health and safety or fundamental rights of natural persons/ […] the classification as high-risk does not only depend on the function performed by the AI system, but also on the specific purpose and modalities for which that system is used. - used for the purpose of assessing students - recruitment or selection of natural persons - evaluate the eligibility of natural persons for public assistance benefits and services - evaluate the creditworthiness of natural persons or establish their credit score - used by law enforcement authorities for making individual risk assessments
  • 11.
    12 <event name> Provenance A possible approachto help answer some of the questions: 1. Automatically generate metadata that describes the flow of data through the pipeline as it occurs 2. Persistently store the metadata for each run of the pipeline 3. Map the questions to queries on the metadata store Data provenance is a structured form of metadata that may fit the purpose Article 12 Record-keeping 1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events (‘logs’) while the high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or common specifications.
  • 12.
    13 What is provenance? OxfordEnglish Dictionary: • the fact of coming from some particular source or quarter; origin, derivation • the history or pedigree of a work of art, manuscript, rare book, etc.; • a record of the passage of an item through its various owners: chain of custody Magna Carta (‘the Great Charter’) was agreed between King John and his barons on 15 June 1215.
  • 13.
    14 The W3C PROVmodel (2013) processing Input 1 Input n usage usage Output 1 Output m generation generation (derivation) (derivation)
  • 14.
    15 The W3C PROVmodel (2013) https://www.w3.org/TR/prov-dm/
  • 15.
    18 M Data sources Acquisition, wrangling Test set Training set Preparing for learning Model Selection Training/ test split Model Testing Model Learning Model Validation Predictions Model Usage Decision points: - Source selection - Sample / population shape - Cleaning - Integration Decision points: - Sampling / stratification - Feature selection - Feature engineering - Dimensionality reduction - Regularisation - Imputation - Class rebalancing - … Provenance trace M Model Learning Training set Training / test split Imputation Feature selection D’ D’’ … Hyper parameters C1 C2 C3 Pipeline structure with provenance annotations
  • 16.
    19 <event name> Can provenance helpaddress the new EU regulations? Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels, 21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090 Article 12 Record-keeping 2. The logging capabilities shall ensure a level of traceability of the AI system’s functioning throughout its lifecycle that is appropriate to the intended purpose of the system. 3. In particular, logging capabilities shall enable the monitoring of the operation of the high-risk AI system with respect to the occurrence of situations that may result in the AI system presenting a risk within the meaning of Article 65(1) or lead to a substantial modification, and facilitate the post-market monitoring referred to in Article 61. 4. For high-risk AI systems referred to in paragraph 1, point (a) of Annex III, the logging capabilities shall provide, at a minimum: (a) recording of the period of each use of the system (start date and time and end date and time of each use); (b) the reference database against which input data has been checked by the system; (c) the input data for which the search has led to a match; EN 50 EN (d) the identification of the natural persons involved in the verification of the results, as referred to in Article 14 (5).
  • 17.
    20 <event name> Provenance of what? -Transparent pipeline - Fine-grained datasets - Transparent program PT - Fine-grained datasets Base case: - opaque program Po - coarse-grained dataset Default provenance: - Every output depends on every input - Transparent program PT - coarse-grained datasets
  • 18.
    23 Data Provenance forData Science: technical insight Technical approach [1] - Formalisation of provenance patterns for pipeline operators - Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines - Demonstration of provenance queries - Performance analysis - Collecting provenance incurs space and time overhead - Performance of provenance queries [1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
  • 19.
    24 Pre-processing operators <event name> [1] Berti-EquilleL. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In: The World Wide Web Conference on - WWW ’19. New York, New York, USA: ACM Press; 2019. p. 2580–6. [1] [2] García S, Ramírez-Gallego S, Luengo J, Benítez JM, Herrera F. Big data preprocessing: methods and prospects. Big Data Anal. 2016 Dec 1;1(1):9. [2]
  • 20.
  • 21.
    26 Operators 14/03/2021 03_ b_c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op Data reduction - Feature selection - Instance selection Data augmentation - Space transformation - Instance generation - Encoding (eg one-hot…) Data transformation - Data repair - Binarisation - Normalisation - Discretisation - Imputation Ex.: vertical augmentation  adding columns
  • 22.
    27 Making your codeprovenance-aware df = pd.DataFrame(…) # Create a new provenance document p = pr.Provenance(df, savepath) # create provanance tracker tracker=ProvenanceTracker.ProvenanceTracker(df, p) # instance generation tracker.df = tracker.df.append({'key2': 'K4'}, ignore_index=True) # imputation tracker.df = tracker.df.fillna('imputato') # feature transformation of column D tracker.df['D'] = tracker.df['D']*2 # Feature transformation of column key2 tracker.df['key2'] = tracker.df['key2']*2 Idea: A python tracker object intercepts dataframe operations Operations that are channeled through the tracker generate provenance fragments
  • 23.
  • 24.
    29 Provenance templates Template +binding rules = instantiated provenance fragment + 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V}  {new values: F’, J, V’}
  • 25.
    30 This applies toall operators…
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
    35 Summary Multiple hypotheses regardingData Provenance for Data Science: 1. Is it practical to collect fine-grained provenance? 1. To what extent can it be done automatically? 2. How much does it cost? 2. Is it also useful?  does it help addressing the key questions on high-risk AI systems?
  • 31.
  • 32.

Editor's Notes

  • #3 How about the data used to train / build the model?
  • #15 baseline-noAgents.provn
  • #27 \newcommand{\f}{\textbf{a}} \text{features}~ X=[\f_1 \ldots \f_k] \text{new features}~ Y=[\f'_1 \ldots \f'_l] \noindent new values for each row are  obtained by applying $f$\\ to values in the $X$ features