(Explainable) Data-Centric AI: what are you explaininhg, and to whom?

Prof. Paolo Missier
School of Computer Science
University of Birmingham, UK
May, 2024
A talk given to the
AIM Research Support Facility @ the Turing Institute
(Explainable) Data-Centric AI
My contacts:

2
Data-centric AI
End-to-end processing from data sources to model outputs
[1] Seedat, Nabeel, Fergus Imrie, and Mihaela van der Schaar. ‘DC-Check: A Data-Centric AI Checklist to Guide the Development of Reliable Machine
Learning Systems’. arXiv, 9 November 2022. http://arxiv.org/abs/2211.05764.
Credit: Andrew Ng
Landing.ai
AIM
RSF
May
2024

3
Outline
Ø Model training and data interventions are becoming entangled
This is good! But:
Ø Model-based explanations and data-based explanations should merge, too
§ Data explanations → data provenance??
§ Contextual explanations: the case of healthcare
Ø Some ideas on how this can be achieved
AIM
RSF
May
2024

4
DCAI involves extended feedback loops
[5] Singh, Prerna. ‘Systematic Review of Data-Centric Approaches in Artificial Intelligence and Machine Learning’. Data Science and Management 6,
no. 3 (1 September 2023): 144–57. https://doi.org/10.1016/j.dsm.2023.06.001.
AIM
RSF
May
2024

5
Rapidly emerging literature
[2] Mohammad Hossein Jarrahi, Ali Memariani, and Shion Guha. 2023. The Principles of Data-Centric AI. Commun. ACM 66, 8 (August 2023), 84–92.
https://doi.org/10.1145/3571724
[3] Zha, Daochen, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, and Xia Hu. ‘Data-Centric AI: Perspectives and Challenges’. arXiv, 2 April 2023.
http://arxiv.org/abs/2301.04819.
[4] Zha, Daochen, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu. ‘Data-Centric Artificial Intelligence: A
Survey’. arXiv, 11 June 2023. https://doi.org/10.48550/arXiv.2303.10158.
Source: [3]
Source: [2]
AIM
RSF
May
2024

6
Example: incremental label correction
Aim: to develop data performance benchmarks for ML
Complementing MLPerf benchmaks
Both part of ML Commons
https://www.dataperf.org/
[10] Mazumder, Mark, Colby Banbury, Xiaozhe Yao, Bojan Karlaš, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, et al. ‘DataPerf: Benchmarks for
Data-Centric AI Development’. arXiv, 13 October 2023. https://doi.org/10.48550/arXiv.2207.10062.
Benchmarks emerge through challenges:
demonstrate how model performance can be enhanced through data interventions
AIM
RSF
May
2024

7
Correcting mislabelling: challenge
Context: Vision dataset for image classification tasks
Given: a training set Dtr of labelled images and a classification task T
images annotated with
image-level labels,
object bounding boxes,
object segmentation
masks…
Training data from
OpenImage V7 dataset
Scenario: realistically, some of the labels in Dtr are noisy
- Challenge: suggest a strategy to achieve the minimal number of label
fixing (“cleaning”) to achieve a target performance gain relative to P
P = Perf(M(Dtr))
Best performance when model M is trained on a perfectly labelled Dtr
and evaluated on independent test set Dtest
https://www.dataperf.org/training-set-acquisition/acquisition-overview
AIM
RSF
May
2024

8
Data Cleaning simulation pattern
cleaning
priority
strategy
D’
Model
training
M’
Model
eval
Dtr
corrupt
labels
Dn
Fixed Training
code
Eval
Score
clean
Model
training
Competitor side Evaluator side
A noisy version Dn is generated from
Dtr (eg label flipping)
Target performance recorded by
training on Dtr and testing on Dtest
Strategies are scored based on number
of cleaning actions required to achieve
95% of target performance
- Labelling strategy: ranking of examples in Dn to be cleaned to achieve performance close to P
AIM
RSF
May
2024

9
Data-X: explaining the data side of ML/AI
Training
datasets
Source
datasets
Data processing
Data-X
Data explanation questions:
Ø How was a dataset processed, step-by-step
Ø Whole dataset --> individual items
Ø Which ones?
Ø Why was a specific data item transformed?
Data-Centric AI: complex data transformations and filtering
- Model-driven data cleaning
- Model-driven training set optimization
- …
AIM
RSF
May
2024
Model
outputs
Training
Model-X

10
In our running example…
CSi
Di
Model
training
M’
eval
Dn
Mbest
MLOps
We would like to:
1. Document that Di was derived from Dn using
CSi, as part of a longer pipeline
2. Be able to identify:
1. What effect CSi had on Dn:
1. Which data labels were cleaned
2. Why they were cleaned
3. Make sure CSi can be reused safely:
1. Specify assumptions, pre-requisites
2. Provide examples of past usages
AIM
RSF
May
2024
CSi = version i of some Cleaning Strategy

11
Mission: make new data-centric algorithms explainable, reusable
problem instances
Prov-DB
Data Training Ops
Enable
reuse
Observe /
record
Reproduce /
explain
Curated
Data toolkit
Goals: to support
• Reusability and emerging best practices for
complex data intervention + usage patterns
• Reproducibility, explainability of pipeline instances
How:
- Enable data processing observations / capture
- Build a curated catalogue of interventions + usage patterns
Challenges:
- Observability: Instrumenting common runtime for transparent capture
- Granularity: explanations need to be pitched at the right level for different stakeholders

12
Representing provenance
A formal, interoperable data model
and syntax for generic provenance
constructs
- extensible to domain vocabularies
AIM
RSF
May
2024

13
Provenance layer I: whole dataset
Assumptions:
- Dn, Di atomic units of data
- CS atomic unit of processing
Reproducibility: “Outer layer” questions
- Where does Di come from?
- Which version Di was used to train Mbest?
Derivation:
Di was derived from Dn using CSi
Mbest was trained on Di
Attribution:
CSi was created by <Actor A>
Dn Di
CSi
wasGeneratedBy
used
A
wasAssociatedWith
wasDerivedFrom
Di-1 D’
Ai
wgby
wasDerivedFrom
used
Ai-1
used
D
AIM
RSF
May
2024

14
Provenance layer II: data-granular provenance
Assumptions:
- Dn = {xnj}, Di = {xi
j}
- CS atomic unit of processing
Explainability: Data-level Questions
- which xnj were cleaned?
- “how dirty was Dn?”
in aggregate: how many labels were cleaned to
achieve a target performance?
Item-level Derivations:
for each xi
j that has been cleaned by CSi:
xi
j was derived from xnj
xnj xi
j
CS
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
Internal representation:
- Property-value graphs
- Neo4J
AIM
RSF
May
2024

15
How can we generate these provenance graphs?
Key idea: Interpreter-level observer
- Requires observer at the boundaries of each processor
- Observer has access to individual data items
Gregori, Luca, Paolo Missier, Matthew Stidolph, riccardo Torlone, and Alessandro Wood. Design and Development of a Provenance Capture Platform
for Data Science. In Procs. 3rd DATAPLAT Workshop, Co-Located with ICDE 2024. Utrecht, NL: IEEE, 2024.
A. Chapman, P. Missier, G. Simonelli, and R. Torlone. 2020. Capturing and querying fine-grained provenance of preprocessing pipelines in data
science. Proc. VLDB Endow. 14, 4 (December 2020), 507–520. https://doi.org/10.14778/3436905.3436911
A. Chapman, L. Lauro, P. Missier, and R. Torlone. 2022. DPDS: assisting data science with data provenance. Proc. VLDB Endow. 15, 12 (2022), 3614–
3617. https://doi.org/10.14778/3554821.3554857
Adriane Chapman, Luca Lauro, Paolo Missier, and Riccardo Torlone. 2024. Supporting Better Insights of Data Science Pipelines with Fine-grained
Provenance. ACM Trans. Database Syst. Just Accepted (February 2024). https://doi.org/10.1145/3644385
xj xij
Op
wasGeneratedBy
used
actor
wasAssociatedWith
wasDerivedFrom
A starting point:
Data Provenance for Data Science (DPDS)
AIM
RSF
May
2024

16
Capturing provenance: sketch
Typical operator implementation:
- Pandas / Spark python pipeline / Dataframe datasets
- CS can be a method call or a code block:
1 - method call:
D’ = Op(D)
2 - Code block:
D à
à Di
“Begin Op”
--
--
--
“End Op”
D D’
Op
wasGeneratedBy
wasDerivedFrom
used
wasDerivedFrom
used
wasGeneratedBy
Op D’ train
D MLOps
Layer I (coarse): Process-level observer
M
AIM
RSF
May
2024

17
Pipeline -- Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
One-hot encoding
df = pd.merge(df_A, df_B, on=['key1', 'key2'], how='left’) # join
df = df.fillna('imputed’) # Imputation
df = pd.merge(df, df_C, on=['key1', 'key2'], how='left’) #join
df = df.fillna(value={'E':'Ex', 'F':'Fx’}) # Imputation
# one-hot encoding
c = 'E'
dummies = []
dummies.append(pd.get_dummies(df[c]))
df_dummies = pd.concat(dummies, axis=1)
df = pd.concat((df, df_dummies), axis=1)
df = df_A.drop([c], axis=1)
AIM
RSF
May
2024

18
Minimal code instrumentation
Approach:
- add an observer to monitor dataframe changes
- mostly transparent to application
- some control over Tracker surfaced
AIM
RSF
May
2024

19
Provenance traversals – example
Capture, store and query element-level provenance
- Derivation of each element of each intermediate dataframe (when possible)
- Efficiently, at scale
fillna
Join
df_1
df_B (df_0)
df_A (df_-1)
AIM
RSF
May
2024

20
Explain what? end-to-end
Model
outputs
Training
datasets
Source
datasets
Data processing
Training
Model-X Data-X
Incorrect and correct predictions,
and examples selected to explain them
Lin, Jinkun, Anqi Zhang, Mathias Lécuyer, Jinyang Li, Aurojit Panda, and Siddhartha Sen. ‘Measuring the Effect of Training Data on Deep Learning
Predictions via Randomized Experiments’. In Proceedings of the 39th International Conference on Machine Learning, 13468–504. PMLR, 2022.
https://proceedings.mlr.press/v162/lin22h.html.
Provenance:
Where do these examples come from?
eg explain the data augmentation strategy
AIM
RSF
May
2024

21
Explain to whom?
Explanations should be contextualised / customized / adapted for different "stakeholders”
Example: healthcare
Two broad categories of stakeholders:
1. "AI clients”
- Health care professionals: GPs, specialists, nurses, administrators, policy makers, regulators
- Patients and public
- involved in the co-design of AI-based solutions
- not primarily involved in AI development and validation
2. Data and AI specialists: data controllers, data scientists, AI experts
Responsible AI:
- The two categories should work together (co-design)
- Each stakeholder will have a different role as part of the "DCAI loop"
AIM
RSF
May
2024

22
Model-X and Data-X differ across stakeholders
AI clients may hold specialists accountable for the trustworthiness of the final product
Ø Health care professionals:
- Expect quantified confidence in the model output: "epistemic humility”
Ø Regulators, policy makers
- Expect evidence of model fairness
Ø Patients and public:
- May accept trust by transitivity (eg I trust my doctor who trusts the system)
AIM
RSF
May
2024

23
Mapping the XDCAI space
What are you explaining?
Model
outputs
Training
datasets
Source
datasets
Data processing
Training
Data controllers
Data Scientists
AI developers
Clinical professionals
- Doctors, Nurses, …
Patients & Public
Regulators
To whom?
Health admin
LIME, SHAP, occlusion testing
Influence functions, subgroup testing…
How has the training set been produced?
Why has data point X in the input been affected?
Contextualised end-to-end explanations”
AIM
RSF
May
2024

24
The bigger picture
Model
outputs
Training
datasets
Source
datasets
Data processing Training M Inference/generation
contextualised explanations
Observe & record
Model
explanations
DPDS as a
starting point
Socio-technical
Co-design
clients
specialists
AIM
RSF
May
2024

25
Summary
XDCAI
Model
training
Model-X
Data-X
Data
interventions
Goals:
Make complex data interventions safely reusable and explainable
Ø Demonstrate Data-X using layered provenance
Ø Combine Model-x and Data-X
Ø Support contextualised explanations
AIM
RSF
May
2024

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?

Recommended

Recommended

More Related Content

Similar to (Explainable) Data-Centric AI: what are you explaininhg, and to whom?

Similar to (Explainable) Data-Centric AI: what are you explaininhg, and to whom? (20)

More from Paolo Missier

More from Paolo Missier (20)

Recently uploaded

Recently uploaded (20)

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?