SlideShare a Scribd company logo
1 of 40
Data-centric AI and the convergence of data and model engineering:
opportunities to streamline the end-to-end data value chain
Prof. Paolo Missier
School of Computing
Newcastle University, UK
University of Évora, Portugal
Nov 22-24, 2023
2
Health Data Science is all about the Data
Tracking trajectories of multiple long-term conditions
using dynamic patient-cluster associations. Kremer, R.;
Raza, S. M.; Eto, F.; Casement, J.; Atallah, C.; Finer, S.;
Lendrem, D.; Barnes, M.; Reynolds, N. J; and Missier,
P. In 2022 IEEE International Conference on Big Data
(Big Data), pages 4390–4399, December 2022.
IDEAL
2023
3
Data-centric AI (DCAI) – is that a thing?
towardsdatascience.com
https://datacentricai.org/
https://landing.ai/data-centric-ai/
IDEAL
2023
4
Background: The ”landingAI” DCAI pitch
IDEAL
2023
5
A practical view
https://www.vanderschaar-lab.com/data-centric-ai/
[1] Seedat, Nabeel, Fergus Imrie, and Mihaela van der Schaar. ‘DC-Check: A Data-Centric AI Checklist to Guide the Development of Reliable
Machine Learning Systems’. arXiv, 9 November 2022. http://arxiv.org/abs/2211.05764.
IDEAL
2023
6
IDEAL
2023
Talk outline
• Data Centric AI sounds familiar to database people…
 old wine in new bottles?
• What is new:
 Iteration patterns that span the entire data / training / deployment / monitoring lifecycle
• A few examples: Training set cleaning, pruning
• Goal: harvesting this knowledge into reusable patterns and libraries
 To support end-to-end data science pipelines in a principled way
• How can we do this?
 Data versioning, provenance in support of reproducibility and explainability
 Current technical solutions, and open challenges
7
Rapidly emerging literature
[2] Zha, Daochen, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, and Xia Hu. ‘Data-Centric AI: Perspectives and Challenges’. arXiv, 2 April 2023.
http://arxiv.org/abs/2301.04819.
[3] Mohammad Hossein Jarrahi, Ali Memariani, and Shion Guha. 2023. The Principles of Data-Centric AI. Commun. ACM 66, 8 (August 2023), 84–92.
https://doi.org/10.1145/3571724
[4] Zha, Daochen, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu. ‘Data-Centric Artificial Intelligence: A
Survey’. arXiv, 11 June 2023. https://doi.org/10.48550/arXiv.2303.10158.
Source: [2]
Source: [3]
IDEAL
2023
8
Hypothesis: DCAI involves extended feedback loops
Source: [5]
[5] Singh, Prerna. ‘Systematic Review of Data-Centric Approaches in Artificial Intelligence and Machine Learning’. Data Science and Management 6,
no. 3 (1 September 2023): 144–57. https://doi.org/10.1016/j.dsm.2023.06.001.
IDEAL
2023
9
IDEAL
2023
The interesting twist: end-to-end patterns
Data
interventions
Competitive
Modelling
Learning-ready
data
Evaluation
(model
performance)
Ops /
Monitoring
Data
Selection
Why are these interesting?
- human-in-the-loop pipelines
- challenging to explain, reproduce
- a combination of skills required
10
A closer look: data cleaning for ML
[6] Neutatz, Felix, et al. "From Cleaning before ML to Cleaning for ML." IEEE Data Eng. Bull. 44.1 (2021): 24-41.
Hypothesis:
data cleaning needs to take an end-to-end application-driven approach that integrates
cleaning throughout the ML application
These data interventions by themselves are not entirely new to Data Quality practitioners
However: DQ can be model- and application-specific. [6]
IDEAL
2023
11
Example: ActiveClean
[7] Krishnan, Sanjay, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. ‘ActiveClean: Interactive Data Cleaning for Statistical
Modeling’. Proceedings of the VLDB Endowment 9, no. 12 (1 August 2016): 948–59. https://doi.org/10.14778/2994509.2994514.
Initial model trained on a dirty training set
 The consequence of dirty data is that the
wrong loss function is optimized
 The best model is a suboptimal point along
the “real” loss function
Approach: Incremental cleaning combines with SGD:
Repeat until <cleaning budget reached>:
 ActiveClean selects ”dirty” items that are
expected to move model down the gradient
 items are cleaned (eg manually) and added to
the clean set
- relies on convexity for convergence
“cleaning”  (manually) removing outliers and attribute transformation
IDEAL
2023
Problem: Full cleaning not feasible
12
Some concrete goals
1. Reusability and reproducibility
• Build a curated, reusable library of safe and predictable data+model intervention patterns
2. Versioning: End-to-end pipeline management  beyond model management (eg MLFlow)
3. Explainability: why was an intervention needed? How and Why was a datapoint affected?
• How can this be achieved?
• What language is needed to express explanations?
• At which level of abstraction?
To provide infrastructure support to streamline interesting end-to-end patterns
IDEAL
2023
13
Use cases from the DataPerf challenges
Aim: to develop data performance benchmarks for ML
Complementing MLPerf benchmaks
Both part of ML Commons
Why are these interesting?
- surface and demonstrate novel data intervention approaches
- evidence new data/training interleaving patterns
- offer opportunities for reusability / generalisability
- challenges typically fix dataset + model, focus on optimal interventions
https://www.dataperf.org/
[10] Mazumder, Mark, Colby Banbury, Xiaozhe Yao, Bojan Karlaš, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, et al. ‘DataPerf: Benchmarks for
Data-Centric AI Development’. arXiv, 13 October 2023. https://doi.org/10.48550/arXiv.2207.10062.
Benchmarks emerge through challenges:
demonstrate how model performance can be enhanced through data interventions
IDEAL
2023
14
Use case 1: “Training Set Debugging” challenge
Context: Vision dataset for image classification tasks
Given: a training set Dtr of labelled images and a classification task T
images annotated with
image-level labels,
object bounding boxes,
object segmentation
masks…
Training data from
OpenImage V7 dataset
“Data-Centric” Challenge Scenario and Goals:
- realistically, some of the labels in Dtr are noisy
- Goal: suggest a strategy to achieve the minimal number of label fixing
“cleaning” to achieve a target performance gain relative to P
IDEAL
2023
For reference, we first calculate best model performance P for M(Dtr)
trained on a perfectly labelled Dtr and an independent test set Dtest
15
IDEAL
2023
Data Cleaning simulation pattern
cleaning
priority
strategy
D’
Model
training
M’
Model
eval
Dtr
corrupt
labels
Dn
Fixed Training
code
Eval
Score
clean
Model
training
Competitor side Evaluator side
A noisy version Dn is generated from
Dtr (eg label flipping)
Target performance recorded by
training on Dtr and testing on Dtest
Strategies are scored based on number
of cleaning actions required to achieve
95% of target performance
Model performance on Dn will be degraded relative to top performance P achieved by M on Dtr
Strategy must suggest ranking of examples in Dn such that by "cleaning" those in order,
performance approximates top performance
16
What can be learnt from this exercise?
cleaning
strategy
D’
Model
training
M’
eval
Dn
Mbest
MLOps
The challenge is effectively a simulation of a 2-levels iterative process:
Challenge winners will have developed and demonstrated new
strategies for training set debugging
However:
Strategy may be optimized for dataset Dn, task T, and the pre-
selected model
IDEAL
2023
17
Provenance and versioning
CSi
Di
Model
training
M’
eval
Dn
Mbest
MLOps
We would like to:
1. Document that Di was derived from Dn using
CSi, as part of a longer pipeline
2. Be able to identify:
1. What effect CSi had on Dn:
1. Which data labels were cleaned
2. Why they were cleaned
3. Make sure CSi can be reused safely:
1. Specify assumptions, pre-requisites
2. Provide examples of past usages
IDEAL
2023
18
Provenance layers I: coarse provenance
Assumptions:
- Dn, Di atomic units of data
- CS atomic unit of processing
Reproducibility: “Outer layer” questions:
- Where does Di come from?
- Which version Di was used to train Mbest?
Derivation:
Di was derived from Dn using CSi
Mbest was trained on Di
Attribution:
CSi was created by <creator C> Dn Di
CSi
wasGeneratedBy
wasDerivedFrom
C
wasAssociatedWith
used
IDEAL
2023
19
Provenance layers II: data-granular provenance
Assumptions:
- Dn = {xnj}, Di = {xi
j}
- CS atomic unit of processing
Explainability: Data-level Questions:
- which xnj were cleaned?
- “how dirty was Dn?”
in aggregate: how many labels were
cleaned to achieve a target performance?
Derivations:
for each xi
j that has been cleaned by CSi:
xi
j was derived from xnj xnj xi
j
CSi
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
IDEAL
2023
20
Provenance layers III: data- and process-granular provenance
Assumptions:
- Dn = {xnj}, Di = {xi
j}
- CS is provenance-aware: able to
attach a local explanation to each data
derivation
Explainability: CS explains its own actions:
- “Why was xnj cleaned?”
Derivations: same as previous
- local explanations algorithm-specific
- some standardization possible, eg human-
supplied labels (Active Learning)
xnj xi
j
CSi
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
why
IDEAL
2023
21
Representing provenance
A formal, interoperable data model and syntax for generic provenance constructs
- accommodates layers I, II
- extensible to a domain vocabulary  eg DC-Check
IDEAL
2023
22
Example: Layers I, II
xnj xi
j
CS
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
why
entity(D_noisy, [ prov:type=”training-set’])
entity(D_clean, [ prov:type=”training-set’])
entity(e1, [ prov:type=”training-data”, inSet=‘D_noisy’, index=j, val=V])
entity(e2, [ prov:type=”training-data”, inset=‘D_clean, index=j, val=W])
entity(C, [ prov:type=”prov:agent”, prov:type=“CS-creator”])
activity(CS, [ prov:type=”cleaning-strategy”, version=”v1.0”, desc=‘…’])
wasDerivedFrom(e2, e1)
used(CS,e1)
wasGeneratedBy(e2, CS)
wasAssociatedWith(CS,C)
IDEAL
2023
Internal representation:
Property-value graphs!
Hint:
Neo4J works well…
Surface representation:
PROV-N: a relational-like notation (a la Datalog)
23
Use case 2: training set optimisation
Motivation: training efficiency
 model performance (test loss) correlates with training data size D according to a power law [11]
However, “Since scalings with N (model size), D (training tokens), Cmin (compute budget) are power-laws,
there are diminishing returns with increasing scale.” [11]
[11] ] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language
models. arXiv preprint arXiv:2001.08361, 2020.
This motivates trying to optimize D:
1- Redundancy in D leads to wasted training time
2- Not all training examples are equally important for
training:
 which ones should be kept / removed?
IDEAL
2023
24
Training set optimization Task 1: reducing redundancy
[12] Abbas, Amro, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S. Morcos. ‘SemDeDup: Data-Efficient Learning at Web-Scale through
Semantic Deduplication’. arXiv, 22 March 2023. http://arxiv.org/abs/2303.09540.
Approach [12]:
1. Map the training set D to an embedded space – using pre-trained foundation models
2. Cluster all data points in embedded space using k-means
3. Using cosine similarity, identify similar points within each cluster. Threshold and select
IDEAL
2023
25
Training set optimization Task 2: pruning easy/hard examples
Main findings from [13]:
1. Not all training examples are created equal
• Hard vs easy
2. The best pruning strategy depends on the
amount of initial data
• Small TS  keep the easy examples
• Large TS  keep the hard examples
[13] Sorscher, Ben, et al. "Beyond neural scaling laws: beating power law scaling via data pruning." Advances in Neural Information Processing
Systems 35 (2022): 19523-19536.
Repr from [13]
A real simple pruning method – very similar to Task 1
"To compute a self-supervised pruning metric for ImageNet, we perform k-means
clustering in the embedding space of an ImageNet pre-trained self-supervised model and
define the difficulty of each data point by the Euclidean distance to its nearest cluster
centroid, or prototype"
Caveat: only tested on ImageNet!
IDEAL
2023
26
Filter
Provenance for training set optimization
This is a classic filter pipeline – only a little more sophisticated:
TSfull TSopt
Filter
wasGeneratedBy
wasDerivedFrom
used
TSfull Embed Cluster Select TSopt
Layers I and II are very similar to Use Case 1:
Reproducibility:
- Where does TSopt come from?
 black / gray box options
TSfull TSopt
wasDerivedFrom
used
Embed Cluster Select
TSemb TSclus
used
wgby
used
wgby wgby
IDEAL
2023
27
Provenance for training set optimization / Layer II
Assumptions:
- TSfull = {ti}, TSopt = {ti}
- Filter is an atomic unit of processing
Explainability: Data-level Questions:
- which ti were filtered out?
- “how redundant was TSfull?”
Derivations:
for each ti that has been removed by Filter:
ti was invalidated by Filter
TSfull ti
Filter
wasInvalidatedBy
used ti
ti
IDEAL
2023
28
Provenance for training set optimization / Layer III
Explainability:
- Why was ti selected/removed?
1. ti belongs to cluster Ch,
2. there exists tj in Ch such that d(ti,tj) < δ,
 ti and tj redundant,
 tj selected, ti removed”
Assumptions:
- TSfull = {ti}, TSopt = {ti}
- Filter (embed/cluster/filter) is
provenance-aware
A specific (formal) language is needed to express these explanations
… from which a natural language rendering can then be created
IDEAL
2023
29
Representing provenance: Layer III
Need a language to express solution-specific explanations:
xnj xi
j
CS
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
why
- Why was ti selected/removed?
1. ti belongs to cluster Ch,
2. there exists tj in Ch such that d(ti,tj) < δ,
 ti and tj redundant,
 tj selected, ti removed”
- “Why was xnj cleaned?”
IDEAL
2023
TSfull ti
Filter
wasInvalidatedBy
used
why
30
A possible catalogue of DC operators: DC-Check
[1] Seedat, Nabeel, Fergus Imrie, and Mihaela van der Schaar. ‘DC-Check: A Data-Centric AI Checklist to Guide the Development of
Reliable Machine Learning Systems’. arXiv, 9 November 2022. http://arxiv.org/abs/2211.05764.
IDEAL
2023
31
Data and training
Q1 Continuous data curation:
- automatic labelling (semi-supervised)
- merging / fusing data sources
- data forensics (detect inappropriate data subsets)
Q3 Data quality assessment:
- Verify ground truth
- Some samples are easier to learn from than others
Q4 Synthetic data, data augmentation
- improve dataset coverage
- unbiased data generation
- stress-testing data
Q5 Model architecture and hyper parameter search
Q7 Balancing robustness and fairness with performance
IDEAL
2023
32
Testing and deployment
Q10, Q11 Model evaluation and monitoring:
- Identify model failure scenarios, additional targeted data collection
- Hidden groups and hidden stratification
- Detect training and deployment distribution mismatch issues
Q12 Detect and track data shift:
- Construct new datasets for retraining
- Feedback-driven datasets
- Errors inform the automatic creation, updating of training sets
IDEAL
2023
33
MLFlow?
General pattern: data and models interleaved
Data Competitive
Modelling
Learning-ready
data
Evaluation
(model
performance)
Monitoring
Curation
Engineering
Synthetic data
Augmentation
Data
Selection
How do we capture
these interventions in
a principled way?
1. Identify data interventions
2. Track data operations
3. Link to model tracking
MLFlow and other model
management frameworks
MLDev
Data interventions MLOps
IDEAL
2023
34
Capturing provenance: Layer I
CSi Di
Model
training
Dn
Mbest MLOps
Typical implementation:
- Pandas / Spark python pipeline / Dataframe datasets
- CS can be a method call or a code block:
Use case 1: Layer I (coarse): Process-level observer
1 - method call:
Di = CS(Dn)
2 - Code block:
Dn 
 Di
“Begin CS”
--
--
--
“End CS”
Dn Di
CS
wasGeneratedBy
wasDerivedFrom
used
wasDerivedFrom
used
wasGeneratedBy
IDEAL
2023
35
Capturing provenance: Layer II
Layer II (data-granular): Interpreter-level observer
- Requires observer at the boundaries of CS, i.e. to tell which x.label have changed
- Observer has access to individual dataframe elements
- But it is unaware of data transformation semantics
[14] A. Chapman, P. Missier, G. Simonelli, and R. Torlone. 2020. Capturing and querying fine-grained provenance of preprocessing
pipelines in data science. Proc. VLDB Endow. 14, 4 (December 2020), 507–520. https://doi.org/10.14778/3436905.3436911
[15] A. Chapman, L. Lauro, P. Missier, and R. Torlone. 2022. DPDS: assisting data science with data provenance. Proc. VLDB Endow. 15, 12
(2022), 3614–3617. https://doi.org/10.14778/3554821.3554857
xnj xi
j
CSi
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
A possible starting point:
Data Provenance for Data Science (DPDS)
IDEAL
2023
36
A generic dataframe observer for Pandas
Approach:
- add an observer to monitor dataframe changes
- mostly transparent to application
- some control surfaced
IDEAL
2023
37
Observable derivations
fillna
1-hot encoding
Imputation and one-hot encoding:
Granular provenance
Derivation of each element of each intermediate dataframe (when possible)
IDEAL
2023
39
Benefits and current limitations
 Designed for Layer II provenance, but can be used for Layer I (atomic data, atomic transformation)
 Implementation on a Neo4J backend  explainability questions map well to graph queries (Cypher)
✗ Unaware of operator semantics  provenance not always precise
✗ Scalability challenges when provenance is granular and changes pervasive
✗ Requires summarization / compression
IDEAL
2023
40
Capturing provenance: Layer III
Layer III (data- and process-granular):
Requires explanation generator as part of the transformation logic
Approach: operator send “explanations” to provenance server using API at runtime
At a chosen granularity: dataset  data item
xnj xi
j
CS
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
why
CS
D D’
Prov-DB
{xi, x’i, expli}
IDEAL
2023
41
Summary of goals and action plan
problem instances
Prov-DB
Data Training Ops
Enable
reuse
Observe /
record
Reproduce /
explain
Curated
Data toolkit
Goals: to support
• Reusability and emerging best practices for
complex data intervention + usage patterns
• Reproducibility, explainability of pipeline instances
How:
- Enable data processing observations / capture
- Build a curated catalogue of interventions + usage patterns
- Associate provenance with data + model versions
Challenges:
- Observability: Instrumenting common runtime for transparent capture
- Granularity: pick a layer (I-II-III): precision vs scalability  how much do we need?
- “why?” vocabulary and language for expressing explanations
IDEAL
2023

More Related Content

What's hot

Homogeneous ddbms
Homogeneous ddbmsHomogeneous ddbms
Homogeneous ddbmsPooja Dixit
 
information retrieval Techniques and normalization
information retrieval Techniques and normalizationinformation retrieval Techniques and normalization
information retrieval Techniques and normalizationAmeenababs
 
Credit Suisse: Multi-Domain Enterprise Reference Data
Credit Suisse: Multi-Domain Enterprise Reference DataCredit Suisse: Multi-Domain Enterprise Reference Data
Credit Suisse: Multi-Domain Enterprise Reference DataOrchestra Networks
 
DAS Slides: Best Practices in Metadata Management
DAS Slides: Best Practices in Metadata ManagementDAS Slides: Best Practices in Metadata Management
DAS Slides: Best Practices in Metadata ManagementDATAVERSITY
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Data-Ed Online: Data Architecture Requirements
Data-Ed Online: Data Architecture RequirementsData-Ed Online: Data Architecture Requirements
Data-Ed Online: Data Architecture RequirementsDATAVERSITY
 
Whitepaper on Master Data Management
Whitepaper on Master Data Management Whitepaper on Master Data Management
Whitepaper on Master Data Management Jagruti Dwibedi ITIL
 
Information storage and management
Information storage and managementInformation storage and management
Information storage and managementAkash Badone
 
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?Bernard Marr
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrievalBasma Gamal
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceDATAVERSITY
 
Creating a Data Management Plan
Creating a Data Management PlanCreating a Data Management Plan
Creating a Data Management PlanKristin Briney
 
Windows 7 forensics event logs-dtl-r3
Windows 7 forensics event logs-dtl-r3Windows 7 forensics event logs-dtl-r3
Windows 7 forensics event logs-dtl-r3CTIN
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 

What's hot (20)

Homogeneous ddbms
Homogeneous ddbmsHomogeneous ddbms
Homogeneous ddbms
 
Network forensics1
Network forensics1Network forensics1
Network forensics1
 
information retrieval Techniques and normalization
information retrieval Techniques and normalizationinformation retrieval Techniques and normalization
information retrieval Techniques and normalization
 
Credit Suisse: Multi-Domain Enterprise Reference Data
Credit Suisse: Multi-Domain Enterprise Reference DataCredit Suisse: Multi-Domain Enterprise Reference Data
Credit Suisse: Multi-Domain Enterprise Reference Data
 
DAS Slides: Best Practices in Metadata Management
DAS Slides: Best Practices in Metadata ManagementDAS Slides: Best Practices in Metadata Management
DAS Slides: Best Practices in Metadata Management
 
Ado.net
Ado.netAdo.net
Ado.net
 
Cyber forensics and auditing
Cyber forensics and auditingCyber forensics and auditing
Cyber forensics and auditing
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Data-Ed Online: Data Architecture Requirements
Data-Ed Online: Data Architecture RequirementsData-Ed Online: Data Architecture Requirements
Data-Ed Online: Data Architecture Requirements
 
Query expansion
Query expansionQuery expansion
Query expansion
 
Whitepaper on Master Data Management
Whitepaper on Master Data Management Whitepaper on Master Data Management
Whitepaper on Master Data Management
 
Information storage and management
Information storage and managementInformation storage and management
Information storage and management
 
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
 
Introduction to Complex Networks
Introduction to Complex NetworksIntroduction to Complex Networks
Introduction to Complex Networks
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrieval
 
Data Management
Data Management Data Management
Data Management
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
 
Creating a Data Management Plan
Creating a Data Management PlanCreating a Data Management Plan
Creating a Data Management Plan
 
Windows 7 forensics event logs-dtl-r3
Windows 7 forensics event logs-dtl-r3Windows 7 forensics event logs-dtl-r3
Windows 7 forensics event logs-dtl-r3
 
Data Quality
Data QualityData Quality
Data Quality
 

Similar to Data-centric AI and the convergence of data and model engineering: opportunities to streamline the end-to-end data value chain

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
 
deep_Visualization in Data mining.ppt
deep_Visualization in Data mining.pptdeep_Visualization in Data mining.ppt
deep_Visualization in Data mining.pptPerumalPitchandi
 
An Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their ClassifiersAn Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their ClassifiersIJAEMSJORNAL
 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
 
A modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringA modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringSK Ahammad Fahad
 
An Effective Storage Management for University Library using Weighted K-Neare...
An Effective Storage Management for University Library using Weighted K-Neare...An Effective Storage Management for University Library using Weighted K-Neare...
An Effective Storage Management for University Library using Weighted K-Neare...C Sai Kiran
 
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemIJSRD
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemIJSRD
 
A Deep Dissertion Of Data Science Related Issues And Its Applications
A Deep Dissertion Of Data Science  Related Issues And Its ApplicationsA Deep Dissertion Of Data Science  Related Issues And Its Applications
A Deep Dissertion Of Data Science Related Issues And Its ApplicationsTracy Hill
 
Implementation of Automated Attendance System using Deep Learning
Implementation of Automated Attendance System using Deep LearningImplementation of Automated Attendance System using Deep Learning
Implementation of Automated Attendance System using Deep LearningMd. Mahfujur Rahman
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsIRJET Journal
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelDr. Abdul Ahad Abro
 

Similar to Data-centric AI and the convergence of data and model engineering: opportunities to streamline the end-to-end data value chain (20)

Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
deep_Visualization in Data mining.ppt
deep_Visualization in Data mining.pptdeep_Visualization in Data mining.ppt
deep_Visualization in Data mining.ppt
 
An Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their ClassifiersAn Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their Classifiers
 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
 
Data science
Data science Data science
Data science
 
A modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringA modified k means algorithm for big data clustering
A modified k means algorithm for big data clustering
 
An Effective Storage Management for University Library using Weighted K-Neare...
An Effective Storage Management for University Library using Weighted K-Neare...An Effective Storage Management for University Library using Weighted K-Neare...
An Effective Storage Management for University Library using Weighted K-Neare...
 
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
 
A Deep Dissertion Of Data Science Related Issues And Its Applications
A Deep Dissertion Of Data Science  Related Issues And Its ApplicationsA Deep Dissertion Of Data Science  Related Issues And Its Applications
A Deep Dissertion Of Data Science Related Issues And Its Applications
 
Implementation of Automated Attendance System using Deep Learning
Implementation of Automated Attendance System using Deep LearningImplementation of Automated Attendance System using Deep Learning
Implementation of Automated Attendance System using Deep Learning
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
 

More from Paolo Missier

Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
 

More from Paolo Missier (20)

Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Data-centric AI and the convergence of data and model engineering: opportunities to streamline the end-to-end data value chain

  • 1. Data-centric AI and the convergence of data and model engineering: opportunities to streamline the end-to-end data value chain Prof. Paolo Missier School of Computing Newcastle University, UK University of Évora, Portugal Nov 22-24, 2023
  • 2. 2 Health Data Science is all about the Data Tracking trajectories of multiple long-term conditions using dynamic patient-cluster associations. Kremer, R.; Raza, S. M.; Eto, F.; Casement, J.; Atallah, C.; Finer, S.; Lendrem, D.; Barnes, M.; Reynolds, N. J; and Missier, P. In 2022 IEEE International Conference on Big Data (Big Data), pages 4390–4399, December 2022. IDEAL 2023
  • 3. 3 Data-centric AI (DCAI) – is that a thing? towardsdatascience.com https://datacentricai.org/ https://landing.ai/data-centric-ai/ IDEAL 2023
  • 4. 4 Background: The ”landingAI” DCAI pitch IDEAL 2023
  • 5. 5 A practical view https://www.vanderschaar-lab.com/data-centric-ai/ [1] Seedat, Nabeel, Fergus Imrie, and Mihaela van der Schaar. ‘DC-Check: A Data-Centric AI Checklist to Guide the Development of Reliable Machine Learning Systems’. arXiv, 9 November 2022. http://arxiv.org/abs/2211.05764. IDEAL 2023
  • 6. 6 IDEAL 2023 Talk outline • Data Centric AI sounds familiar to database people…  old wine in new bottles? • What is new:  Iteration patterns that span the entire data / training / deployment / monitoring lifecycle • A few examples: Training set cleaning, pruning • Goal: harvesting this knowledge into reusable patterns and libraries  To support end-to-end data science pipelines in a principled way • How can we do this?  Data versioning, provenance in support of reproducibility and explainability  Current technical solutions, and open challenges
  • 7. 7 Rapidly emerging literature [2] Zha, Daochen, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, and Xia Hu. ‘Data-Centric AI: Perspectives and Challenges’. arXiv, 2 April 2023. http://arxiv.org/abs/2301.04819. [3] Mohammad Hossein Jarrahi, Ali Memariani, and Shion Guha. 2023. The Principles of Data-Centric AI. Commun. ACM 66, 8 (August 2023), 84–92. https://doi.org/10.1145/3571724 [4] Zha, Daochen, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu. ‘Data-Centric Artificial Intelligence: A Survey’. arXiv, 11 June 2023. https://doi.org/10.48550/arXiv.2303.10158. Source: [2] Source: [3] IDEAL 2023
  • 8. 8 Hypothesis: DCAI involves extended feedback loops Source: [5] [5] Singh, Prerna. ‘Systematic Review of Data-Centric Approaches in Artificial Intelligence and Machine Learning’. Data Science and Management 6, no. 3 (1 September 2023): 144–57. https://doi.org/10.1016/j.dsm.2023.06.001. IDEAL 2023
  • 9. 9 IDEAL 2023 The interesting twist: end-to-end patterns Data interventions Competitive Modelling Learning-ready data Evaluation (model performance) Ops / Monitoring Data Selection Why are these interesting? - human-in-the-loop pipelines - challenging to explain, reproduce - a combination of skills required
  • 10. 10 A closer look: data cleaning for ML [6] Neutatz, Felix, et al. "From Cleaning before ML to Cleaning for ML." IEEE Data Eng. Bull. 44.1 (2021): 24-41. Hypothesis: data cleaning needs to take an end-to-end application-driven approach that integrates cleaning throughout the ML application These data interventions by themselves are not entirely new to Data Quality practitioners However: DQ can be model- and application-specific. [6] IDEAL 2023
  • 11. 11 Example: ActiveClean [7] Krishnan, Sanjay, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. ‘ActiveClean: Interactive Data Cleaning for Statistical Modeling’. Proceedings of the VLDB Endowment 9, no. 12 (1 August 2016): 948–59. https://doi.org/10.14778/2994509.2994514. Initial model trained on a dirty training set  The consequence of dirty data is that the wrong loss function is optimized  The best model is a suboptimal point along the “real” loss function Approach: Incremental cleaning combines with SGD: Repeat until <cleaning budget reached>:  ActiveClean selects ”dirty” items that are expected to move model down the gradient  items are cleaned (eg manually) and added to the clean set - relies on convexity for convergence “cleaning”  (manually) removing outliers and attribute transformation IDEAL 2023 Problem: Full cleaning not feasible
  • 12. 12 Some concrete goals 1. Reusability and reproducibility • Build a curated, reusable library of safe and predictable data+model intervention patterns 2. Versioning: End-to-end pipeline management  beyond model management (eg MLFlow) 3. Explainability: why was an intervention needed? How and Why was a datapoint affected? • How can this be achieved? • What language is needed to express explanations? • At which level of abstraction? To provide infrastructure support to streamline interesting end-to-end patterns IDEAL 2023
  • 13. 13 Use cases from the DataPerf challenges Aim: to develop data performance benchmarks for ML Complementing MLPerf benchmaks Both part of ML Commons Why are these interesting? - surface and demonstrate novel data intervention approaches - evidence new data/training interleaving patterns - offer opportunities for reusability / generalisability - challenges typically fix dataset + model, focus on optimal interventions https://www.dataperf.org/ [10] Mazumder, Mark, Colby Banbury, Xiaozhe Yao, Bojan Karlaš, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, et al. ‘DataPerf: Benchmarks for Data-Centric AI Development’. arXiv, 13 October 2023. https://doi.org/10.48550/arXiv.2207.10062. Benchmarks emerge through challenges: demonstrate how model performance can be enhanced through data interventions IDEAL 2023
  • 14. 14 Use case 1: “Training Set Debugging” challenge Context: Vision dataset for image classification tasks Given: a training set Dtr of labelled images and a classification task T images annotated with image-level labels, object bounding boxes, object segmentation masks… Training data from OpenImage V7 dataset “Data-Centric” Challenge Scenario and Goals: - realistically, some of the labels in Dtr are noisy - Goal: suggest a strategy to achieve the minimal number of label fixing “cleaning” to achieve a target performance gain relative to P IDEAL 2023 For reference, we first calculate best model performance P for M(Dtr) trained on a perfectly labelled Dtr and an independent test set Dtest
  • 15. 15 IDEAL 2023 Data Cleaning simulation pattern cleaning priority strategy D’ Model training M’ Model eval Dtr corrupt labels Dn Fixed Training code Eval Score clean Model training Competitor side Evaluator side A noisy version Dn is generated from Dtr (eg label flipping) Target performance recorded by training on Dtr and testing on Dtest Strategies are scored based on number of cleaning actions required to achieve 95% of target performance Model performance on Dn will be degraded relative to top performance P achieved by M on Dtr Strategy must suggest ranking of examples in Dn such that by "cleaning" those in order, performance approximates top performance
  • 16. 16 What can be learnt from this exercise? cleaning strategy D’ Model training M’ eval Dn Mbest MLOps The challenge is effectively a simulation of a 2-levels iterative process: Challenge winners will have developed and demonstrated new strategies for training set debugging However: Strategy may be optimized for dataset Dn, task T, and the pre- selected model IDEAL 2023
  • 17. 17 Provenance and versioning CSi Di Model training M’ eval Dn Mbest MLOps We would like to: 1. Document that Di was derived from Dn using CSi, as part of a longer pipeline 2. Be able to identify: 1. What effect CSi had on Dn: 1. Which data labels were cleaned 2. Why they were cleaned 3. Make sure CSi can be reused safely: 1. Specify assumptions, pre-requisites 2. Provide examples of past usages IDEAL 2023
  • 18. 18 Provenance layers I: coarse provenance Assumptions: - Dn, Di atomic units of data - CS atomic unit of processing Reproducibility: “Outer layer” questions: - Where does Di come from? - Which version Di was used to train Mbest? Derivation: Di was derived from Dn using CSi Mbest was trained on Di Attribution: CSi was created by <creator C> Dn Di CSi wasGeneratedBy wasDerivedFrom C wasAssociatedWith used IDEAL 2023
  • 19. 19 Provenance layers II: data-granular provenance Assumptions: - Dn = {xnj}, Di = {xi j} - CS atomic unit of processing Explainability: Data-level Questions: - which xnj were cleaned? - “how dirty was Dn?” in aggregate: how many labels were cleaned to achieve a target performance? Derivations: for each xi j that has been cleaned by CSi: xi j was derived from xnj xnj xi j CSi wasGeneratedBy used C wasAssociatedWith wasDerivedFrom IDEAL 2023
  • 20. 20 Provenance layers III: data- and process-granular provenance Assumptions: - Dn = {xnj}, Di = {xi j} - CS is provenance-aware: able to attach a local explanation to each data derivation Explainability: CS explains its own actions: - “Why was xnj cleaned?” Derivations: same as previous - local explanations algorithm-specific - some standardization possible, eg human- supplied labels (Active Learning) xnj xi j CSi wasGeneratedBy used C wasAssociatedWith wasDerivedFrom why IDEAL 2023
  • 21. 21 Representing provenance A formal, interoperable data model and syntax for generic provenance constructs - accommodates layers I, II - extensible to a domain vocabulary  eg DC-Check IDEAL 2023
  • 22. 22 Example: Layers I, II xnj xi j CS wasGeneratedBy used C wasAssociatedWith wasDerivedFrom why entity(D_noisy, [ prov:type=”training-set’]) entity(D_clean, [ prov:type=”training-set’]) entity(e1, [ prov:type=”training-data”, inSet=‘D_noisy’, index=j, val=V]) entity(e2, [ prov:type=”training-data”, inset=‘D_clean, index=j, val=W]) entity(C, [ prov:type=”prov:agent”, prov:type=“CS-creator”]) activity(CS, [ prov:type=”cleaning-strategy”, version=”v1.0”, desc=‘…’]) wasDerivedFrom(e2, e1) used(CS,e1) wasGeneratedBy(e2, CS) wasAssociatedWith(CS,C) IDEAL 2023 Internal representation: Property-value graphs! Hint: Neo4J works well… Surface representation: PROV-N: a relational-like notation (a la Datalog)
  • 23. 23 Use case 2: training set optimisation Motivation: training efficiency  model performance (test loss) correlates with training data size D according to a power law [11] However, “Since scalings with N (model size), D (training tokens), Cmin (compute budget) are power-laws, there are diminishing returns with increasing scale.” [11] [11] ] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. This motivates trying to optimize D: 1- Redundancy in D leads to wasted training time 2- Not all training examples are equally important for training:  which ones should be kept / removed? IDEAL 2023
  • 24. 24 Training set optimization Task 1: reducing redundancy [12] Abbas, Amro, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S. Morcos. ‘SemDeDup: Data-Efficient Learning at Web-Scale through Semantic Deduplication’. arXiv, 22 March 2023. http://arxiv.org/abs/2303.09540. Approach [12]: 1. Map the training set D to an embedded space – using pre-trained foundation models 2. Cluster all data points in embedded space using k-means 3. Using cosine similarity, identify similar points within each cluster. Threshold and select IDEAL 2023
  • 25. 25 Training set optimization Task 2: pruning easy/hard examples Main findings from [13]: 1. Not all training examples are created equal • Hard vs easy 2. The best pruning strategy depends on the amount of initial data • Small TS  keep the easy examples • Large TS  keep the hard examples [13] Sorscher, Ben, et al. "Beyond neural scaling laws: beating power law scaling via data pruning." Advances in Neural Information Processing Systems 35 (2022): 19523-19536. Repr from [13] A real simple pruning method – very similar to Task 1 "To compute a self-supervised pruning metric for ImageNet, we perform k-means clustering in the embedding space of an ImageNet pre-trained self-supervised model and define the difficulty of each data point by the Euclidean distance to its nearest cluster centroid, or prototype" Caveat: only tested on ImageNet! IDEAL 2023
  • 26. 26 Filter Provenance for training set optimization This is a classic filter pipeline – only a little more sophisticated: TSfull TSopt Filter wasGeneratedBy wasDerivedFrom used TSfull Embed Cluster Select TSopt Layers I and II are very similar to Use Case 1: Reproducibility: - Where does TSopt come from?  black / gray box options TSfull TSopt wasDerivedFrom used Embed Cluster Select TSemb TSclus used wgby used wgby wgby IDEAL 2023
  • 27. 27 Provenance for training set optimization / Layer II Assumptions: - TSfull = {ti}, TSopt = {ti} - Filter is an atomic unit of processing Explainability: Data-level Questions: - which ti were filtered out? - “how redundant was TSfull?” Derivations: for each ti that has been removed by Filter: ti was invalidated by Filter TSfull ti Filter wasInvalidatedBy used ti ti IDEAL 2023
  • 28. 28 Provenance for training set optimization / Layer III Explainability: - Why was ti selected/removed? 1. ti belongs to cluster Ch, 2. there exists tj in Ch such that d(ti,tj) < δ,  ti and tj redundant,  tj selected, ti removed” Assumptions: - TSfull = {ti}, TSopt = {ti} - Filter (embed/cluster/filter) is provenance-aware A specific (formal) language is needed to express these explanations … from which a natural language rendering can then be created IDEAL 2023
  • 29. 29 Representing provenance: Layer III Need a language to express solution-specific explanations: xnj xi j CS wasGeneratedBy used C wasAssociatedWith wasDerivedFrom why - Why was ti selected/removed? 1. ti belongs to cluster Ch, 2. there exists tj in Ch such that d(ti,tj) < δ,  ti and tj redundant,  tj selected, ti removed” - “Why was xnj cleaned?” IDEAL 2023 TSfull ti Filter wasInvalidatedBy used why
  • 30. 30 A possible catalogue of DC operators: DC-Check [1] Seedat, Nabeel, Fergus Imrie, and Mihaela van der Schaar. ‘DC-Check: A Data-Centric AI Checklist to Guide the Development of Reliable Machine Learning Systems’. arXiv, 9 November 2022. http://arxiv.org/abs/2211.05764. IDEAL 2023
  • 31. 31 Data and training Q1 Continuous data curation: - automatic labelling (semi-supervised) - merging / fusing data sources - data forensics (detect inappropriate data subsets) Q3 Data quality assessment: - Verify ground truth - Some samples are easier to learn from than others Q4 Synthetic data, data augmentation - improve dataset coverage - unbiased data generation - stress-testing data Q5 Model architecture and hyper parameter search Q7 Balancing robustness and fairness with performance IDEAL 2023
  • 32. 32 Testing and deployment Q10, Q11 Model evaluation and monitoring: - Identify model failure scenarios, additional targeted data collection - Hidden groups and hidden stratification - Detect training and deployment distribution mismatch issues Q12 Detect and track data shift: - Construct new datasets for retraining - Feedback-driven datasets - Errors inform the automatic creation, updating of training sets IDEAL 2023
  • 33. 33 MLFlow? General pattern: data and models interleaved Data Competitive Modelling Learning-ready data Evaluation (model performance) Monitoring Curation Engineering Synthetic data Augmentation Data Selection How do we capture these interventions in a principled way? 1. Identify data interventions 2. Track data operations 3. Link to model tracking MLFlow and other model management frameworks MLDev Data interventions MLOps IDEAL 2023
  • 34. 34 Capturing provenance: Layer I CSi Di Model training Dn Mbest MLOps Typical implementation: - Pandas / Spark python pipeline / Dataframe datasets - CS can be a method call or a code block: Use case 1: Layer I (coarse): Process-level observer 1 - method call: Di = CS(Dn) 2 - Code block: Dn   Di “Begin CS” -- -- -- “End CS” Dn Di CS wasGeneratedBy wasDerivedFrom used wasDerivedFrom used wasGeneratedBy IDEAL 2023
  • 35. 35 Capturing provenance: Layer II Layer II (data-granular): Interpreter-level observer - Requires observer at the boundaries of CS, i.e. to tell which x.label have changed - Observer has access to individual dataframe elements - But it is unaware of data transformation semantics [14] A. Chapman, P. Missier, G. Simonelli, and R. Torlone. 2020. Capturing and querying fine-grained provenance of preprocessing pipelines in data science. Proc. VLDB Endow. 14, 4 (December 2020), 507–520. https://doi.org/10.14778/3436905.3436911 [15] A. Chapman, L. Lauro, P. Missier, and R. Torlone. 2022. DPDS: assisting data science with data provenance. Proc. VLDB Endow. 15, 12 (2022), 3614–3617. https://doi.org/10.14778/3554821.3554857 xnj xi j CSi wasGeneratedBy used C wasAssociatedWith wasDerivedFrom A possible starting point: Data Provenance for Data Science (DPDS) IDEAL 2023
  • 36. 36 A generic dataframe observer for Pandas Approach: - add an observer to monitor dataframe changes - mostly transparent to application - some control surfaced IDEAL 2023
  • 37. 37 Observable derivations fillna 1-hot encoding Imputation and one-hot encoding: Granular provenance Derivation of each element of each intermediate dataframe (when possible) IDEAL 2023
  • 38. 39 Benefits and current limitations  Designed for Layer II provenance, but can be used for Layer I (atomic data, atomic transformation)  Implementation on a Neo4J backend  explainability questions map well to graph queries (Cypher) ✗ Unaware of operator semantics  provenance not always precise ✗ Scalability challenges when provenance is granular and changes pervasive ✗ Requires summarization / compression IDEAL 2023
  • 39. 40 Capturing provenance: Layer III Layer III (data- and process-granular): Requires explanation generator as part of the transformation logic Approach: operator send “explanations” to provenance server using API at runtime At a chosen granularity: dataset  data item xnj xi j CS wasGeneratedBy used C wasAssociatedWith wasDerivedFrom why CS D D’ Prov-DB {xi, x’i, expli} IDEAL 2023
  • 40. 41 Summary of goals and action plan problem instances Prov-DB Data Training Ops Enable reuse Observe / record Reproduce / explain Curated Data toolkit Goals: to support • Reusability and emerging best practices for complex data intervention + usage patterns • Reproducibility, explainability of pipeline instances How: - Enable data processing observations / capture - Build a curated catalogue of interventions + usage patterns - Associate provenance with data + model versions Challenges: - Observability: Instrumenting common runtime for transparent capture - Granularity: pick a layer (I-II-III): precision vs scalability  how much do we need? - “why?” vocabulary and language for expressing explanations IDEAL 2023

Editor's Notes

  1. This brings us to the notion of provenance and versioning, and we can articulate it into 3 levels