SlideShare a Scribd company logo
Luca Gregori∗, Paolo Missier†, Matthew Stidolph‡, Riccardo Torlone∗ and Alessandro Wood∗
∗Department of Engineering, Roma Tre University, Italy
†School of Computer Science, University of Birmingham, UK
‡Newcastle University, School of Computing
DATAPLAT@ICDE
May 2024
Utrecht, NL
Design and Development of a Provenance
Capture Platform for Data Science
2
Setting and questions
Model
outputs
Training
datasets
Source
datasets
Data processing Training M Inference/generation
Data explanation questions:
• Which data transformations were applied to raw input dataset(s) to generate the final
training set used for modelling?
• Which of the individual data items were affected by each of the transformations
• What was the effect?
DATAPLAT@ICDE
2024
3
Provenance basics
Abstract data transformation operator: 𝐷 → (OP) → 𝐷ʹ
D D’
A
wasGeneratedBy
wasDerivedFrom
used
Provenance expression:
DATAPLAT@ICDE
2024
4
Extension to DAG topologies
Example: inputs 𝐷0
𝑎, 𝐷0
𝑏 Dc
0 are processed independently and eventually merged into 𝐷𝑛:
Da
0 OP1 Da
1
Db
0 OP2 Db
1
Dc
0
OP3 Dbc
0
OP4 Dabc
3
Da
0 OP1 Da
1
Db
0 OP2 Db
1
Dc
0
OP3 Dbc
0
OP4 Dabc
3
used
used
used
used
wgby
wgby
DATAPLAT@ICDE
2024
5
The Big Provenance Dogma
Data provenance is an enabler for:
• Transparency
• Explainability
• Reproducibility
…for a variety of underlying process and source / target data combinations
Model
outputs
Training
datasets
Data processing Training M Inference/generation
Source Target
Process
DATAPLAT@ICDE
2024
6
DATAPLAT@ICDE
2024
Contributions
ü Analysis of over 500 Data Science pipeline
§ “in the wild” --> Kaggle
§ “controlled” --> ML Bazaar
ü Formal provenance semantics for a catalogue of commonly used Data Science operators
ü Data Provenance for Data Science (DPDS)
§ automatically track granular provenance from Pandas
§ maximally transparent and minimally intrusive to the programmer
ü Empirical evaluation against a grid of 3 benchmark datasets x 3 synthetic pipelines
7
Data processing pipelines analysis: ML Bazaar
ü Facilitates developing ML and AutoML systems
ü Workflow style: Pipelines composed out of pre-defined primitives
ü Data + task pairs with benchmark results over multiple data types
✗ Only 5 types of operators
✗ Single location, controlled ecosystem
DFS = Deep Feature Synthesis
DATAPLAT@ICDE
2024
8
Data processing pipelines analysis: Kaggle
Scope: top 200 most upvoted python notebooks related to machine learning on Kaggle
Ø 29 unique pre-processing operations
Ø 12 appear in less than 10 pipelines
§ Transposing
§ changing index values
Ø feature augmentation (58)
Ø scaling operations (38)
DATAPLAT@ICDE
2024
9
Data processing operators
DATAPLAT@ICDE
2024
10
Data reduction
<latexit sha1_base64="caGX98B8rPEaUMv/+I4c5iOo7DY=">AAADNnicbVLLjtMwFHXDawivDizZGCrEIKEqQSNggzRiZsFykOjMSE2prl2nNXXsyL6mVFHWfA1b4FfYsENs+QEk3IeApnMlSyfnnGvH14eVSjpMkm+t6MLFS5ev7FyNr12/cfNWe/f2iTPectHjRhl7xsAJJbXooUQlzkoroGBKnLLp4UI/fS+sk0a/wXkpBgWMtcwlBwzUsH3v6CF9QbNSDg/3jh49pnT17eS4gCUVD9udpJssi26DdA06ZF3Hw93W72xkuC+ERq7AuX6alDiowKLkStRx5p0ogU9hLPoBaiiEG1TLu9T0QWBGNDc2LI10yf7fUUHh3LxgwVkATlxTW5DnaX2P+fNBJXXpUWi+Oij3iqKhi8HQkbSCo5oHANzK8K+UT8ACxzC+jVOYMVME5uo4zrSYcVMUoEdVZkxdZSg+IMsrU9ebYo51Px1Ufw2dtN6y/GuHpmbKIArtvBWLq9GM5dQ0PBNjwY+DD1Q5gbdVZuV4gmCtmTW3C5HYtIa5KRWebabP9b8zUgc3MzOUoqF5HZIURF8qv5hJCEzajMc2OHnSTZ9291/vdw5erqOzQ+6S+2SPpOQZOSCvyDHpEU4+kk/kM/kSfY2+Rz+inytr1Fr33CEbFf36A9XlEGY=</latexit>
D0
= ⇡C(D), D0
= C(D) - Projection, Selection
<latexit sha1_base64="fFqxFPpIMzZxYgmMXJgJEbRtTTU=">AAADX3icbVJdixMxFE1bddeqa1efxJdgEbogZcbvBx9WV1B8WsHuLjS1ZDJ3prGZZEgy1hLyn/w1gk/qDxFMP9i1070QOHPPucnk5CSl4MZG0c9Gs3Xl6rWd3evtGzdv7d3u7N85MarSDAZMCaXPEmpAcAkDy62As1IDLRIBp8n0aMGffgVtuJKf7LyEUUFzyTPOqA2tcecDKfnYEUcsfLPuiKf+EV7hdyBT0Oefr3PwxPseMTwv6NhddF89iXzv7cHBuNON+tGy8DaI16CL1nU83m/8JaliVQHSMkGNGcZRaUeOasuZAN8mlYGSsinNYRigpAWYkVte2uOHoZPiTOmwpMXL7v8TjhbGzIskKAtqJ6bOLZqXccPKZi9HjsuysiDZ6qCsEtgqvHAQp1wDs2IeAGWah3/FbEI1ZTb4vHFKotTU0sT4dptImDFVFFSmjijlV/4lmVPeb5KZ9cN45M4F3dhvSS7GaZ1TZSBBmkrD4mqYJBlWNc1EaVrlQUdFOaGfHdE8n1iqtZrVtwvZ2ZQG34QIzzaTl+q/KC6DOlEzy6HGVTJELpBVKaqFJyEwcT0e2+DkcT9+3n/28Wn38M06OrvoPnqAeihGL9Aheo+O0QAx9B39QL/Q7+af1k5rr9VZSZuN9cxdtFGte/8AGM4hrw==</latexit>
⇡{Cid,Gender,Age}( Age<30(D))
DATAPLAT@ICDE
2024
11
Data augmentation
Vertical augmentation
<latexit sha1_base64="Jkv8keMS0FhjcfbzwX5TGOOML7Q=">AAADJnicbVJNj9MwEHXDxy7lq4UjF4sKablUCVoB2tMKLhwXie4WNaWauE5j6tiRPSZUUX4KV+DXcEOIG38ECaetgKY7kqWneW884/FLCikshuHPTnDl6rXrB4c3ujdv3b5zt9e/d261M4yPmJbajBOwXArFRyhQ8nFhOOSJ5BfJ8mXDX3zgxgqt3uCq4NMcFkqkggH61KzXjzNtwC1mVXo0fnxC39az3iAchuug+yDaggHZxtms3/kdzzVzOVfIJFg7icICpxUYFEzyuhs7ywtgS1jwiYcKcm6n1Xr2mj7ymTlNtfFHIV1n/6+oILd2lSdemQNmts01ycu4icP0+bQSqnDIFds0Sp2kqGmzCDoXhjOUKw+AGeFnpSwDAwz9una6JFovERJbd7ux4iXTeQ5qXsVa11WM/CMmaaXrepdMsZ5E0+qvYBDVe5J/5dDmdOFJrqwzvHkajZOU6pZm83NeB7LI4F0VG7HIEIzRZfs6b4Fdqd+blP7bSnWp/r0WyqsTXaLgLc4p7xxPukK6ZifeMFHbHvvg/Mkwejo8fn08OH2xtc4heUAekiMSkWfklLwiZ2REGCnJJ/KZfAm+Bt+C78GPjTTobGvuk50Ifv0By7kNaw==</latexit>
↵!
f(X):Y
<latexit sha1_base64="KZhlQb7RQuvbDZlIWBGjNXy9o1c=">AAADPnicbVLLjhMxEHSGxy7hlYUjF4sIKblEGbSCFaflceC4ILK7UiZEPY4nMfHYI7tNiKz5Br6GK/Ab/AA3xJULEp4kAjLZliyVu6rddrvSQgqL/f63RnTp8pWre/vXmtdv3Lx1u3Vw59RqZxgfMC21OU/BcikUH6BAyc8LwyFPJT9L588r/uw9N1Zo9QaXBR/lMFUiEwwwpMatbjLTBtx07LNx3Eky9E+nvOw+oRWEKX8NKuzLzovuuNXu9/qroLsg3oA22cTJ+KDxO5lo5nKukEmwdhj3Cxx5MCiY5GUzcZYXwOahzTBABTm3I796U0kfhMyEZtqEpZCusv9XeMitXeZpUOaAM1vnquRF3NBhdjTyQhUOuWLrRpmTFDWtBkQnwnCGchkAMCPCXSmbgQGGYYxbXVKt5wipLZvNRPEF03kOauITrUufIP+AaeZ1WW6TGZbDeOT/CtpxuSP5Vw51TheB5Mo6w6un0STNqK5p1j8adCCLGbz1iRHTGYIxelE/LlhjWxrmJmX4toW6UP9OCxXUqV6g4DXOqeCoQLpCumomwTBx3R674PRhL37UO3x12D5+trHOPrlH7pMOicljckxekhMyIIx8JJ/IZ/Il+hp9j35EP9fSqLGpuUu2Ivr1B33rF1I=</latexit>
↵!
f1(Age):ageRange(D)
group by gender
avg(age)
Horizontal augmentation
<latexit sha1_base64="/Fez8VR4cSmlF01/YiVQsD5zSEs=">AAADP3icbZLNjtMwEMfd8LWUj+3CkUtEhdTlUDXVChAS0vIlOC4S3V2pKZHjTlpTx47sMaWK/A48DVfgMXgCbogrBySctoJtuiNF+mf+v4njmUkLwQ32et8bwYWLly5f2bnavHb9xs3d1t6tY6OsZjBgSih9mlIDgksYIEcBp4UGmqcCTtLZ88o/+QDacCXf4qKAUU4nkmecUfSppHX/ZdJ/EnuC2klSxhmWr0COQbvHWdLvVO9PJ+D2XefFftJq97q9ZYTbIlqLNlnHUbLX+BOPFbM5SGSCGjOMegWOSqqRMwGuGVsDBWUzOoGhl5LmYEbl8lIuvOcz4zBT2j8Sw2X2bEVJc2MWeerJnOLU1L0qeZ43tJg9GpVcFhZBstVBmRUhqrDqUDjmGhiKhReUae7/NWRTqilD38eNU1KlZkhT45rNWMKcqTynclzGSrkyRviIaVYq5zbNDN0wGpX/gHbktpD/5bTuqcKbII3VUF0tjNMsVDVmqqpxeo6KYkrflbHmkylSrdW8/rnV5M+gvm9C+LHN5bn8e8Wlp1M1Rw41z0q/Ut60hbBVT/zCRPX12BbH/W70oHvw5qB9+Gy9OjvkDrlLOiQiD8kheU2OyIAw8ol8Jl/I1+Bb8CP4GfxaoUFjXXObbETw+y86OxeP</latexit>
E2 = ↵#
Gender:f2(Age)(D)
<latexit sha1_base64="bJJOqZd/k6cJtV5UgJsl0/znBVA=">AAADKHicbZJNj9MwEIbd8LFL+eqyRy4WFVL3UiWIjxWnFXDguEh0t6gJ1cR1WlPHjuwxpYryW7gCv4Yb2iv/AwmnrWCb7kiRXs3zju3MTFpIYTEML1rBtes3bu7t32rfvnP33v3OwYMzq51hfMC01GaYguVSKD5AgZIPC8MhTyU/T+eva37+mRsrtHqPy4InOUyVyAQD9Klx5zD2FNx0XA5fZr0PR1XvzdG40w374Srorog2oks2cTo+aP2JJ5q5nCtkEqwdRWGBSQkGBZO8asfO8gLYHKZ85KWCnNukXL2+oo99ZkIzbfynkK6ylytKyK1d5ql35oAz22R18io2cpgdJ6VQhUOu2PqizEmKmtatoBNhOEO59AKYEf6tlM3AAEPfsK1bUq3nCKmt2u1Y8QXTeQ5qUsZaV2WM/AumWamrahtmWI2ipPxn6EbVjuV/OTSZLjzkyjrD61+jcZpR3fDMdD077wNZzOBjGRsxnSEYoxfN49ZjvmT1fZPSj22hrvR/0kJ5d6oXKHiDOeV3x0NXSFf3xC9M1FyPXXH2pB897z9797R78mqzOvvkIXlEeiQiL8gJeUtOyYAwsiRfyTfyPfgR/Ax+BRdra9Da1BySrQh+/wVrcA35</latexit>
↵#
X:f(Y )(D)
DATAPLAT@ICDE
2024
12
Data transformation
<latexit sha1_base64="XtRrctBkqIU93sb+UHrmtJtjUkA=">AAADHnicbZJNbxMxEIad5aMlfLVw5LIiQiqXaBdVwLGCC8cikTbSbojGjjdr4rVX9rghsvZncAV+DTfEFX4MEt40ArLpSJZezfuMP8ZDayksJsmvXnTt+o2be/u3+rfv3L13/+DwwZnVzjA+YlpqM6ZguRSKj1Cg5OPacKio5Od08br1zy+4sUKrd7iq+aSCuRKFYIAhleUIbuqLo/HTZnowSIbJOuJdkW7EgGzidHrY+53PNHMVV8gkWJulSY0TDwYFk7zp587yGtgC5jwLUkHF7cSv79zET0JmFhfahKUwXmf/r/BQWbuqaCArwNJ2vTZ5lZc5LF5OvFC1Q67Y5UGFkzHquG1APBOGM5SrIIAZEe4asxIMMAxt2jqFar1AoLbp93PFl0xXFaiZz7VufI78I9LC66bZNgtssnTi/wKDtNlB/pVD19N1MLmyzvD2aXFOi1h3mFIbcPPAgaxLeO9zI+YlgjF62d0ufP02GvomZfi2pbqS/6CFCjTVSxS84zkVJiaYrpau7UkYmLQ7Hrvi7NkwfT48fns8OHm1GZ198og8JkckJS/ICXlDTsmIMKLJJ/KZfIm+Rt+i79GPSzTqbWoekq2Ifv4B+VsLDw==</latexit>
⌧f(X)
<latexit sha1_base64="Q7sjzw3r7FpZN6MWGMj9azYMGFk=">AAAD5HicbVJLb9NAELYbHiW8WjhyWREjFQlFccXrWAEHjkWiDykO0ex6N1663rX20RBZ/gfcEFf+Emd+DBKzaQQk7Vw8O9/3zYxnhjZKOj8a/Uq3eteu37i5fat/+87de/d3dh8cOxMs40fMKGNPKTiupOZHXnrFTxvLoaaKn9CztxE/OefWSaM/+kXDJzXMtBSSgcfQdOdnoY3UJdee+IqTwvMvmKX1FrQTxtZLWkeMIEAc99ERHHyw3JHsNIvv7F1GgpN6hhQRNIsKkomMFAWRjhjqAZsrCV0QF6jD9MFHNgdWkXNQgZOsnLayEF1G5tJXKN7DQAHOY+xp9ixmwmYuFKvyJCuwhGEsWBuzSR37GU53BqPhaGnkspOvnEGyssPpbvq7KA0LNY6AKXBunI8aP2nBeskU7/pFcLwBdgYzPkZXQ83dpF1OviNPMFIuexMGR7iM/q9ooXZuUVNk4igrt4nF4FXYOHjxetJK3QTPNbsoJIIi3pC4RlJKy5lXC3SAWYm9ElaBBeZx2WtVqDFnHqjr+v1C8zkzdQ26bAtjuna5bipa03XroPDdOJ+0fwmDvLtE+SeHTcw0CHLtcE/x10hBBTEbnMpYCDPkgWoq+NQWVs4qD9aa+WY6POB1Ks5NKVzbXF/J/4wnjWxq5l7yDSzoeND4bVSIM8GDyTfP47JzvD/MXw6ff9gfHLxZnc528ih5nOwlefIqOUjeJ4fJUcLSF+k4LVPeE72vvW+97xfUrXSleZisWe/HH5yBTeM=</latexit>
the transformation of a set of features X of D using a function f
is obtained by substituting each value dia with f(d⇤a),
for each feature a occurring in X.
Example: data imputation. Here f replaces nulls with the most frequent value, for
column Zip
<latexit sha1_base64="dKf0psuUtfBq7WDfOX5DpzZK5ls=">AAADKnicbZLNjtMwFIXd8DeUvw6IFZuICqmzqRI0ApYjYMFykOjMiCZUN67Tmjp2ZF9TKssPwxZ4GnYjtrwGEk6nAprOlSId3fPd2Lk5RS24wSQ570RXrl67fmPvZvfW7Tt37/X2758YZTVlI6qE0mcFGCa4ZCPkKNhZrRlUhWCnxeJV459+YtpwJd/hqmZ5BTPJS04BQ2vSe5gh2IkrB1mJ7j2v/YEfvD6Y9PrJMFlXvCvSjeiTTR1P9ju/s6mitmISqQBjxmlSY+5AI6eC+W5mDauBLmDGxkFKqJjJ3fr+Pn4SOtO4VDo8EuN19/8JB5Uxq6oIZAU4N22vaV7mjS2WL3LHZW2RSXpxUGlFjCpulhFPuWYUxSoIoJqHu8Z0DhoohpVtnVIotUAojO92M8mWVFUVyKnLlPIuQ/YZi9Ip77fNEv04zd1foJ/6HeTfOLQ9VQeTSWM1az4tzooyVi1mrjTYWeBA1HP44DLNZ3MErdWy/boQg2007E2I8NuW8lL+o+Iy0IVaImctz8qQnmDaWthmJyEwaTseu+Lk6TB9Njx8e9g/ermJzh55RB6TAUnJc3JE3pBjMiKUOPKFfCXfou/Rj+g8+nmBRp3NzAOyVdGvPwnyD0I=</latexit>
⌧f(Zip)(D)
DATAPLAT@ICDE
2024
13
Data fusion: join and append
<latexit sha1_base64="uo1XC2O2rrqRH/7jgx2X/lPakP4=">AAADKHicbZLNbtNAFIUn5q+Ev5Qu2YyIkFhFNqoKy6rtggWLgkhbKXai68k4HjKesWbuNESWn4Ut8DTsULe8BxLjNALi9EqWju75rmd8fdJSCotheNUJbt2+c/fezv3ug4ePHj/p7T49s9oZxodMS20uUrBcCsWHKFDyi9JwKFLJz9P5ceOfX3JjhVYfcVnypICZEplggL416e2djN/R+JMWaoyT6rimJ+MPk14/HISrotsiWos+WdfpZLfzO55q5gqukEmwdhSFJSYVGBRM8robO8tLYHOY8ZGXCgpuk2p1+5q+8J0pzbTxj0K66v4/UUFh7bJIPVkA5rbtNc2bvJHD7E1SCVU65IpdH5Q5SVHTZhV0KgxnKJdeADPC35WyHAww9AvbOCXVeo6Q2rrbjRVfMF0UoKZVrHVdxcg/Y5pVuq43zQzrUZRUf4F+VG8h/8ah7enSm1xZZ3jzaTROM6pbTK4NuJnnQJY5jKvYiFmOYIxetF/nQ7CJ+r1J6X/bQt3IN5HwdKoXKHjLc8pnx5uulK7ZiQ9M1I7Htjh7NYgOBvvv9/uHR+vo7JBn5Dl5SSLymhySt+SUDAkjS/KFfCXfgu/Bj+BncHWNBp31zB7ZqODXH8rzDh4=</latexit>
DL
./t
C DR
<latexit sha1_base64="fiWoK5ivN8nYSDBQRhG2qdf4NTc=">AAADIXicbZJNbxMxEIad5auErxaOXCwiJE7RLqoKxwp64MChINJWym6qsePNmnjtlT0mRKv9H1yBX8MNcUP8FiS8aQRk05EsvZr3GX+Mh1VKOozjn73oytVr12/s3Ozfun3n7r3dvfsnznjLxYgbZewZAyeU1GKEEpU4q6yAkilxyuYvW//0g7BOGv0Ol5XISphpmUsOGFKTo8lrmnodJD2avD3fHcTDeBV0WyRrMSDrOD7f6/1Op4b7UmjkCpwbJ3GFWQ0WJVei6afeiQr4HGZiHKSGUrisXl27oY9DZkpzY8PSSFfZ/ytqKJ1bliyQJWDhul6bvMwbe8yfZ7XUlUeh+cVBuVcUDW17QKfSCo5qGQRwK8NdKS/AAsfQqY1TmDFzBOaafj/VYsFNWYKe1qkxTZ2i+Igsr03TbJo5NuMkq/8Cg6TZQv6VQ9czVTCFdt6K9mk0ZTk1HaYwFvwscKCqAiZ1auWsQLDWLLrbhd/fREPflArfttCX8u+N1IFmZoFSdLzVpATTV8q3PQkDk3THY1ucPB0mB8P9N/uDwxfr0dkhD8kj8oQk5Bk5JK/IMRkRTiz5RD6TL9HX6Fv0PfpxgUa9dc0DshHRrz8U/gvI</latexit>
DL
] DR
<latexit sha1_base64="ZSc/aIuuYda02WJ0QVQW8PzBr8E=">AAADIHicbZJNbxMxEIad5auErxaOXCwiJE7RLqqAYwU9cOBQEGkrZTfV2PFmTbz2Yo8J0Wp/B1fg13BDHOG/IOFNIyCbjmTp1bzP+GM8rFLSYRz/7EWXLl+5em3nev/GzVu37+zu3T12xlsuRtwoY08ZOKGkFiOUqMRpZQWUTIkTNn/R+icfhHXS6Le4rERWwkzLXHLAkMoOJ69Sr4Oih5M3Z7uDeBivgm6LZC0GZB1HZ3u93+nUcF8KjVyBc+MkrjCrwaLkSjT91DtRAZ/DTIyD1FAKl9WrWzf0YchMaW5sWBrpKvt/RQ2lc8uSBbIELFzXa5MXeWOP+bOslrryKDQ/Pyj3iqKhbQvoVFrBUS2DAG5luCvlBVjgGBq1cQozZo7AXNPvp1osuClL0NM6NaapUxQfkeW1aZpNM8dmnGT1X2CQNFvIv3LoeqYKptDOW9E+jaYsp6bDFMaCnwUOVFXApE6tnBUI1ppFd7vw+Zto6JtS4dsW+kL+nZE60MwsUIqOt5qUYPpK+bYnYWCS7nhsi+PHw+TJcP/1/uDg+Xp0dsh98oA8Igl5Sg7IS3JERoST9+QT+Uy+RF+jb9H36Mc5GvXWNffIRkS//gCWmQue</latexit>
DL
] DR
<latexit sha1_base64="Tf7s3qEix3yKzKbh9vcpsGLm1tk=">AAADSXicbVLdihMxGE2n/qz1r6uX3gSL4FWZkaLeCIu7FwperGJ3FzrTkkkzbWwmGZIv1hLyIj6Nt+oT+BjeiSCY6ZbVTveDgZNzzpdMvpy8EtxAHP9oRe0rV69d37vRuXnr9p273f17J0ZZTdmQKqH0WU4ME1yyIXAQ7KzSjJS5YKf54rDWTz8ybbiS72FVsawkM8kLTgkEatIdHI3fpB8Ul2OXmgJzKZn2ExfYflqAO3w99S+Oxu8uFh6H1aTbi/vxuvAuSDaghzZ1PNlv/UmnitqSSaCCGDNK4goyRzRwKpjvpNawitAFmbFRgJKUzGRufT2PHwVmigulwycBr9n/OxwpjVmVeXCWBOamqdXkZdrIQvE8c1xWFpik5wcVVmBQuJ4VnnLNKIhVAIRqHv4V0znRhEKY6NYpuVILILnxnU4q2ZKqsiRy6lKlvEuBfYK8cMr7bbEAP0oyd2HoJX7H8q+dNDVVBZFJYzWrr4bTvMCq4ZkrTews+Iio5iS8seazORCt1bK5XUjJtjXMTYjwbEt5qb8OTXDnagmcNTQrQ7iCaCth65mEwCTNeOyCkyf95Gl/8HbQO3i5ic4eeoAeoscoQc/QAXqFjtEQUfQZfUFf0bfoe/Qz+hX9PrdGrU3PfbRV7fZfdvcasg==</latexit>
DL
./inner
DL.CId=DR.CId DR
DATAPLAT@ICDE
2024
14
Conceptual provenance capture model: templates
<latexit sha1_base64="Q+fPf+TzQY7bxgC074TZYQmdfIg=">AAAKYHicjZZfb9s2EMDldn9Sr12T7W17IRYES7E1s4cWG/ZUZ83SAEXiFUlbIPYMSjrJRClSIym7hqAPucc97GWfZEfZiylK7SbAAI/3uzuSdzw6zDnTZjD4s3fr9gcffvTxzp3+J3fvfXp/d++zl1oWKoKrSHKpXodUA2cCrgwzHF7nCmgWcngVvvnZ6l8tQGkmxaVZ5TDNaCpYwiJqcGq2u5wIWEYyy6iIy0liquvhtCwnBt6aMCn3h1VV9RvIXCpapFU5oTyf09/KiWLp3FCl5NKia/WsTGbDQ3RXjlKoHvxE7JCm8IIKlKvDpw9mu/uDo0H9kfZguBnsB5tvPNvb+WMSy6jIQJiIU62vh4PcTEuqDIs4YOhCQ06jNxjmGoeCZqCnZX1CFTnAmZgkUuFPGFLPuhYlzbReZSGSGTVz7evsZJfuujDJj9OSibwwIKJ1oKTgxEhij5vETEFk+AoHNFIM10qiOVU0MpiUPn6Nw82VXODR2jDMlLXkHX8M+RawQuV7cO19rQyTrZqGumUtuWOOQuUvMN3qx6e+OYit9kT4WtzzVj1Cwdc7vkdpK7TNIALPN0Qteh6WabhykGV6vPIRJhJeYKbA5ag++3c6bpss46QJPwXFFhD/omTWYumywS5bmzRGXcoG0zoIO/VeIAYOKTXuHmgo227sto5XrZ1KlXXtU+MtdgOvZT/Fbg5A2BR4eTpxKvikVb/YnKSKwVYpDiP4/R36XLEMbqCv/WKARUadi7AWW0sRMoYtdG4lL5y9o1uilvwNcxqCcyvWor+etDgZ2dVi50uL74BWFSEHJAXxsNDYJojEHkywczHQ3xK8CGzB7Nh3whwvrHbjE/RkdHqDfIOIvSnNWEiSTKp3BiU5LzTBxiwMdqCDAyJzUNRI5S9HycI549NabN1Zj6JpJ6e7nM2wxFTrgYm41E4TqkUPUZA7GbESjVqJo2kTW8sdoIIlNmnXXy37dfDW2HLfltxa9sv3f60+c4NlmKautdfzTkN80UlKha+w8OmLerbbgnLebTTi/H12iJ9pybHtxI3l30z6eZT/nSPl3y4fGNt3d/vijC6f+cTZ+fkWmCywZc3B0Bk+ya3Kuri67ERlYVrs2Xk3y0SbjZtJj7uSPr547viLKCfjqsI/QUP/L0978PL7o+Hjo8Gvj/afHG/+Du0EXwZfBYfBMPgheBI8C8bBVRAFf/Vu9+727t35u7/Tv9/fW6O3ehubz4PG1//iH9y29FY=</latexit>
↵!
f1(Age):ageRange(D)
A different provenance template pt𝜏 is associated with each type 𝜏 of operator
15
Capturing provenance: bindings
At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V} à {new values: F’, J, V’}
+
Binding rules
<latexit sha1_base64="icVdmbcCfxxYOiITpBtlS3uqwUQ=">AAAD+HicdZNdb9MwFIaTlY8RPtbBJTdHVJQhVVWDJkCTKk2AJsbVkOjWqQ6V4zqtmWNHtrOuBP8X7hC3/Buu+R1IOGkFbTccKTo672O/yTnHccaZNp3OT3+jdu36jZubt4Lbd+7e26pv3z/WMleE9ojkUvVjrClngvYMM5z2M0VxGnN6Ep+9LvWTc6o0k+KDmWU0SvFYsIQRbFxqWP8VNAEZemGKA6nAAtsLAY2k0SD2mggFTQTVUyHVm5ki13RkgQrT3rMwQByLMadwAF3oD9MWHHZZC467b4YFa7mEBaTmxBcoAUBMQB8i+N/xYyqowmbJY8nkiXM5HU5a8K50Oe8mO3Mf+3TZxhGVzSlEQTCsNzrtTrXgchAugoa3WEfDbf+3qwHJU2dPONZ6EHYyExVYGUY4tQFyFcgwOcNjOnChwCnVUVF1w8LjsjyQuHImUhiosss7CpxqPUtjR6bYTPS6Viav0ga5SV5GBRNZbqggc6Mk52AklK2FEVOUGD5zASaKuW8FMsEKE+MGYMUllvLM4FjbIECCTolMUyxGBZLSzrsQJ4W0dlVMjB2EUfEXaIT2EvJvO17XZOZEKnSuaPlrgOIE5BozkQrnY8dhnk3wxwIpNp4YrJScrh/nhnoVdXXj3LVtKq7kP0kmHB3LqWF0TcuFuwtOzDOelzVxAxOuj8fl4PhZO3ze3n2/29h/tRidTe+h98jb8ULvhbfvvfWOvJ5H/ENf+hf+rPa59rX2rfZ9jm74iz0PvJVV+/EHk+tPwQ==</latexit>
For i : 1 . . . n :
used ent.:[hF = Xm, I = i, V = Di,Xm
i|Xm 2 X]
generated ent.:[hF0
= Yh, J = i, v = f(Di,X )i|Yh 2 Y ]
16
Implementation by shape and value diff
Shape changes:
Rows
Added?
Rows
Removed?
Columns
Added?
Columns
Removed?
Columns
Removed?
Horizontal
Augmentation
Reduction
by selection
Reduction
by projection
data
transformation
(composite)
Y
Y
Y
Y
data
transformation
Y
N
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:
data
transformation
(imputation)
data
transformation
1-1 derivations
For each input/output pair Din, Dout of dataframes:
1. Diff both shapes and values of Din, Dout
2. Use the diff to:
• Select the appropriate template
• Bind the template variables using the
relevant values in the two dataframes
• Generate an instantiated provlet
DATAPLAT@ICDE
2024
17
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
<latexit sha1_base64="vtTzVqyQbOaTVii0idD+QwwhSJQ=">AAAEKXicfVNbb9MwFE5WLqPcNvbIi0UFalE0NV03EFKl3Toh7WVI7CLVxXJcpzVz7Mh26IqV/8Ir8Gt4A175HUg4bRltN3GkREfn+853Ep/PUcqZNvX6D3+pdOPmrdvLd8p3791/8HBl9dGJlpki9JhILtVZhDXlTNBjwwynZ6miOIk4PY3O9wr89ANVmknx1oxS2k1wX7CYEWxcCa36a8DFM7CPwtY+wgC+l0y8s9AYwGlscmQPURgcokbuKBGE5b/0RgsanCEbo7D6vJZXnUBtBt5wao3/q5EZevNSrVFtBwdjvY1ZPbuZt+BAKpz1kR1U27VX0LZRMwBtdFG8QpgXPc3Znq0WTBmy0O4UnN0A7KBRAPYDsBeAg6Jppj0AEwE3p1ZGK5X6en0c4GoSTpOKN40jd4y/YU+SLKHCEI617oT11HQtVoYRTvMyzDRNMTnHfdpxqcAJ1V07Xl8OnrpKD8RSuUcYMK7OdlicaD1KIsdMsBnoRawoXod1MhO/7Fom0sxQQSaD4owDI0HhBdBjihLDRy7BRDH3rYAMsMLEOMfMTYmkPDc40nm5DAUdEpkkWPQslDJ326UXJoqtzPN50C28E3btJaES5lco/9rxIiZTB1KhM0WLXwMwioFc4Ewc4XiYpwPsnKZYf2CwUnK4KOduwTzVnRvnbm1DcS2/sK5jR3JoGF3AMuEujwOzlGfFmTjDhIv2uJqcNNbDrfXNN83K9u7UOsveY++JV/VC74W37b32jrxjj/gf/U/+Z/9L6WvpW+l76eeEuuRPe9a8uSj9+gO5hVWq</latexit>
D1 = Da ./left
K1,K2
Db
D2 = ⌧f1(⇤)(D1)
D3 = D2 ./left
K1,K2
Dc
D4 = ⌧f2(E,F )(D3)
D5 = ↵!
h(E):{E4,Ex,E1}(D4)
D6 = ⇡{Ax,B,Ay,D,C,F,E4,Ex,E1,}(D5)
DATAPLAT@ICDE
2024
18
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
df = pd.merge(df_A, df_B, on=['key1', 'key2'], how='left’) # join
df = df.fillna('imputed’) # Imputation
df = pd.merge(df, df_C, on=['key1', 'key2'], how='left’) #join
df = df.fillna(value={'E':'Ex', 'F':'Fx’}) # Imputation
# one-hot encoding
c = 'E'
dummies = []
dummies.append(pd.get_dummies(df[c]))
df_dummies = pd.concat(dummies, axis=1)
df = pd.concat((df, df_dummies), axis=1)
df = df_A.drop([c], axis=1)
DATAPLAT@ICDE
2024
19
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
Dataframes Diff template
D1 ß {Da, Db} Explicit join provenance pattern
D2 ß D1 value change, reduced nulls à imputation Data transformation
D3 ß {D2, Dc} Explicit join provenance pattern
D4 ß D3 value change, reduced nulls à imputation Data transformation
D45 ß D4 Shape change, column(s) added <wait!>
D6 ß D5 Shape change, column(s) removed Data transformation, composite
DATAPLAT@ICDE
2024
20
Program level transparency with control
Approach:
- add an observer to monitor dataframe changes
- mostly transparent to application
- some control over Tracker surfaced
DATAPLAT@ICDE
2024
21
Provenance traversals – example
Capture, store and query element-level provenance
- Derivation of each element of each intermediate dataframe (when possible)
- Efficiently, at scale
fillna
Join
df_1
df_B (df_0)
df_A (df_-1)
DATAPLAT@ICDE
2024
22
Benchmarking: data x pipelines
Datasets:
Pipelines:
Provenance graphs are stored
in a single Neo4J database
DATAPLAT@ICDE
2024
23
Results
The PT/PO ratio provides a rough indication of scalability:
- The graphs for the complete pipelines are close in size to the sum of the sizes of the components’
graphs
1,2,3: pipeline number
DATAPLAT@ICDE
2024
24
Conclusions
ü DPDS generates granular provenance graphs that accurately represent the
underlying data processing
ü A potentially useful building block towards explanations in a Data Centric AI
setting
Limitations:
v No granularity control --> limited scalability
v Operates only on Pandas dataframes
DATAPLAT@ICDE
2024

More Related Content

Similar to Design and Development of a Provenance Capture Platform for Data Science

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Neo4j
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 
Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01jade_22
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform Seldon
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and ComputationTal Lavian Ph.D.
 
Jecb sigmod2014
Jecb sigmod2014Jecb sigmod2014
Jecb sigmod2014Khai Tran
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsNECST Lab @ Politecnico di Milano
 
Modelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri NetsModelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri NetsManuel Martín
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 

Similar to Design and Development of a Provenance Capture Platform for Data Science (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 
Jecb sigmod2014
Jecb sigmod2014Jecb sigmod2014
Jecb sigmod2014
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
rerngvit_phd_seminar
rerngvit_phd_seminarrerngvit_phd_seminar
rerngvit_phd_seminar
 
Modelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri NetsModelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri Nets
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 

More from Paolo Missier

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 

More from Paolo Missier (20)

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 

Recently uploaded

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsUXDXConf
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfChristopherTHyatt
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀DianaGray10
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
Intelligent Gimbal FINAL PAPER Engineering.pdf
Intelligent Gimbal FINAL PAPER Engineering.pdfIntelligent Gimbal FINAL PAPER Engineering.pdf
Intelligent Gimbal FINAL PAPER Engineering.pdfAnthony Lucente
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
 

Recently uploaded (20)

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Intelligent Gimbal FINAL PAPER Engineering.pdf
Intelligent Gimbal FINAL PAPER Engineering.pdfIntelligent Gimbal FINAL PAPER Engineering.pdf
Intelligent Gimbal FINAL PAPER Engineering.pdf
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 

Design and Development of a Provenance Capture Platform for Data Science

  • 1. Luca Gregori∗, Paolo Missier†, Matthew Stidolph‡, Riccardo Torlone∗ and Alessandro Wood∗ ∗Department of Engineering, Roma Tre University, Italy †School of Computer Science, University of Birmingham, UK ‡Newcastle University, School of Computing DATAPLAT@ICDE May 2024 Utrecht, NL Design and Development of a Provenance Capture Platform for Data Science
  • 2. 2 Setting and questions Model outputs Training datasets Source datasets Data processing Training M Inference/generation Data explanation questions: • Which data transformations were applied to raw input dataset(s) to generate the final training set used for modelling? • Which of the individual data items were affected by each of the transformations • What was the effect? DATAPLAT@ICDE 2024
  • 3. 3 Provenance basics Abstract data transformation operator: 𝐷 → (OP) → 𝐷ʹ D D’ A wasGeneratedBy wasDerivedFrom used Provenance expression: DATAPLAT@ICDE 2024
  • 4. 4 Extension to DAG topologies Example: inputs 𝐷0 𝑎, 𝐷0 𝑏 Dc 0 are processed independently and eventually merged into 𝐷𝑛: Da 0 OP1 Da 1 Db 0 OP2 Db 1 Dc 0 OP3 Dbc 0 OP4 Dabc 3 Da 0 OP1 Da 1 Db 0 OP2 Db 1 Dc 0 OP3 Dbc 0 OP4 Dabc 3 used used used used wgby wgby DATAPLAT@ICDE 2024
  • 5. 5 The Big Provenance Dogma Data provenance is an enabler for: • Transparency • Explainability • Reproducibility …for a variety of underlying process and source / target data combinations Model outputs Training datasets Data processing Training M Inference/generation Source Target Process DATAPLAT@ICDE 2024
  • 6. 6 DATAPLAT@ICDE 2024 Contributions ü Analysis of over 500 Data Science pipeline § “in the wild” --> Kaggle § “controlled” --> ML Bazaar ü Formal provenance semantics for a catalogue of commonly used Data Science operators ü Data Provenance for Data Science (DPDS) § automatically track granular provenance from Pandas § maximally transparent and minimally intrusive to the programmer ü Empirical evaluation against a grid of 3 benchmark datasets x 3 synthetic pipelines
  • 7. 7 Data processing pipelines analysis: ML Bazaar ü Facilitates developing ML and AutoML systems ü Workflow style: Pipelines composed out of pre-defined primitives ü Data + task pairs with benchmark results over multiple data types ✗ Only 5 types of operators ✗ Single location, controlled ecosystem DFS = Deep Feature Synthesis DATAPLAT@ICDE 2024
  • 8. 8 Data processing pipelines analysis: Kaggle Scope: top 200 most upvoted python notebooks related to machine learning on Kaggle Ø 29 unique pre-processing operations Ø 12 appear in less than 10 pipelines § Transposing § changing index values Ø feature augmentation (58) Ø scaling operations (38) DATAPLAT@ICDE 2024
  • 10. 10 Data reduction <latexit sha1_base64="caGX98B8rPEaUMv/+I4c5iOo7DY=">AAADNnicbVLLjtMwFHXDawivDizZGCrEIKEqQSNggzRiZsFykOjMSE2prl2nNXXsyL6mVFHWfA1b4FfYsENs+QEk3IeApnMlSyfnnGvH14eVSjpMkm+t6MLFS5ev7FyNr12/cfNWe/f2iTPectHjRhl7xsAJJbXooUQlzkoroGBKnLLp4UI/fS+sk0a/wXkpBgWMtcwlBwzUsH3v6CF9QbNSDg/3jh49pnT17eS4gCUVD9udpJssi26DdA06ZF3Hw93W72xkuC+ERq7AuX6alDiowKLkStRx5p0ogU9hLPoBaiiEG1TLu9T0QWBGNDc2LI10yf7fUUHh3LxgwVkATlxTW5DnaX2P+fNBJXXpUWi+Oij3iqKhi8HQkbSCo5oHANzK8K+UT8ACxzC+jVOYMVME5uo4zrSYcVMUoEdVZkxdZSg+IMsrU9ebYo51Px1Ufw2dtN6y/GuHpmbKIArtvBWLq9GM5dQ0PBNjwY+DD1Q5gbdVZuV4gmCtmTW3C5HYtIa5KRWebabP9b8zUgc3MzOUoqF5HZIURF8qv5hJCEzajMc2OHnSTZ9291/vdw5erqOzQ+6S+2SPpOQZOSCvyDHpEU4+kk/kM/kSfY2+Rz+inytr1Fr33CEbFf36A9XlEGY=</latexit> D0 = ⇡C(D), D0 = C(D) - Projection, Selection <latexit sha1_base64="fFqxFPpIMzZxYgmMXJgJEbRtTTU=">AAADX3icbVJdixMxFE1bddeqa1efxJdgEbogZcbvBx9WV1B8WsHuLjS1ZDJ3prGZZEgy1hLyn/w1gk/qDxFMP9i1070QOHPPucnk5CSl4MZG0c9Gs3Xl6rWd3evtGzdv7d3u7N85MarSDAZMCaXPEmpAcAkDy62As1IDLRIBp8n0aMGffgVtuJKf7LyEUUFzyTPOqA2tcecDKfnYEUcsfLPuiKf+EV7hdyBT0Oefr3PwxPseMTwv6NhddF89iXzv7cHBuNON+tGy8DaI16CL1nU83m/8JaliVQHSMkGNGcZRaUeOasuZAN8mlYGSsinNYRigpAWYkVte2uOHoZPiTOmwpMXL7v8TjhbGzIskKAtqJ6bOLZqXccPKZi9HjsuysiDZ6qCsEtgqvHAQp1wDs2IeAGWah3/FbEI1ZTb4vHFKotTU0sT4dptImDFVFFSmjijlV/4lmVPeb5KZ9cN45M4F3dhvSS7GaZ1TZSBBmkrD4mqYJBlWNc1EaVrlQUdFOaGfHdE8n1iqtZrVtwvZ2ZQG34QIzzaTl+q/KC6DOlEzy6HGVTJELpBVKaqFJyEwcT0e2+DkcT9+3n/28Wn38M06OrvoPnqAeihGL9Aheo+O0QAx9B39QL/Q7+af1k5rr9VZSZuN9cxdtFGte/8AGM4hrw==</latexit> ⇡{Cid,Gender,Age}( Age<30(D)) DATAPLAT@ICDE 2024
  • 11. 11 Data augmentation Vertical augmentation <latexit sha1_base64="Jkv8keMS0FhjcfbzwX5TGOOML7Q=">AAADJnicbVJNj9MwEHXDxy7lq4UjF4sKablUCVoB2tMKLhwXie4WNaWauE5j6tiRPSZUUX4KV+DXcEOIG38ECaetgKY7kqWneW884/FLCikshuHPTnDl6rXrB4c3ujdv3b5zt9e/d261M4yPmJbajBOwXArFRyhQ8nFhOOSJ5BfJ8mXDX3zgxgqt3uCq4NMcFkqkggH61KzXjzNtwC1mVXo0fnxC39az3iAchuug+yDaggHZxtms3/kdzzVzOVfIJFg7icICpxUYFEzyuhs7ywtgS1jwiYcKcm6n1Xr2mj7ymTlNtfFHIV1n/6+oILd2lSdemQNmts01ycu4icP0+bQSqnDIFds0Sp2kqGmzCDoXhjOUKw+AGeFnpSwDAwz9una6JFovERJbd7ux4iXTeQ5qXsVa11WM/CMmaaXrepdMsZ5E0+qvYBDVe5J/5dDmdOFJrqwzvHkajZOU6pZm83NeB7LI4F0VG7HIEIzRZfs6b4Fdqd+blP7bSnWp/r0WyqsTXaLgLc4p7xxPukK6ZifeMFHbHvvg/Mkwejo8fn08OH2xtc4heUAekiMSkWfklLwiZ2REGCnJJ/KZfAm+Bt+C78GPjTTobGvuk50Ifv0By7kNaw==</latexit> ↵! f(X):Y <latexit sha1_base64="KZhlQb7RQuvbDZlIWBGjNXy9o1c=">AAADPnicbVLLjhMxEHSGxy7hlYUjF4sIKblEGbSCFaflceC4ILK7UiZEPY4nMfHYI7tNiKz5Br6GK/Ab/AA3xJULEp4kAjLZliyVu6rddrvSQgqL/f63RnTp8pWre/vXmtdv3Lx1u3Vw59RqZxgfMC21OU/BcikUH6BAyc8LwyFPJT9L588r/uw9N1Zo9QaXBR/lMFUiEwwwpMatbjLTBtx07LNx3Eky9E+nvOw+oRWEKX8NKuzLzovuuNXu9/qroLsg3oA22cTJ+KDxO5lo5nKukEmwdhj3Cxx5MCiY5GUzcZYXwOahzTBABTm3I796U0kfhMyEZtqEpZCusv9XeMitXeZpUOaAM1vnquRF3NBhdjTyQhUOuWLrRpmTFDWtBkQnwnCGchkAMCPCXSmbgQGGYYxbXVKt5wipLZvNRPEF03kOauITrUufIP+AaeZ1WW6TGZbDeOT/CtpxuSP5Vw51TheB5Mo6w6un0STNqK5p1j8adCCLGbz1iRHTGYIxelE/LlhjWxrmJmX4toW6UP9OCxXUqV6g4DXOqeCoQLpCumomwTBx3R674PRhL37UO3x12D5+trHOPrlH7pMOicljckxekhMyIIx8JJ/IZ/Il+hp9j35EP9fSqLGpuUu2Ivr1B33rF1I=</latexit> ↵! f1(Age):ageRange(D) group by gender avg(age) Horizontal augmentation <latexit sha1_base64="/Fez8VR4cSmlF01/YiVQsD5zSEs=">AAADP3icbZLNjtMwEMfd8LWUj+3CkUtEhdTlUDXVChAS0vIlOC4S3V2pKZHjTlpTx47sMaWK/A48DVfgMXgCbogrBySctoJtuiNF+mf+v4njmUkLwQ32et8bwYWLly5f2bnavHb9xs3d1t6tY6OsZjBgSih9mlIDgksYIEcBp4UGmqcCTtLZ88o/+QDacCXf4qKAUU4nkmecUfSppHX/ZdJ/EnuC2klSxhmWr0COQbvHWdLvVO9PJ+D2XefFftJq97q9ZYTbIlqLNlnHUbLX+BOPFbM5SGSCGjOMegWOSqqRMwGuGVsDBWUzOoGhl5LmYEbl8lIuvOcz4zBT2j8Sw2X2bEVJc2MWeerJnOLU1L0qeZ43tJg9GpVcFhZBstVBmRUhqrDqUDjmGhiKhReUae7/NWRTqilD38eNU1KlZkhT45rNWMKcqTynclzGSrkyRviIaVYq5zbNDN0wGpX/gHbktpD/5bTuqcKbII3VUF0tjNMsVDVmqqpxeo6KYkrflbHmkylSrdW8/rnV5M+gvm9C+LHN5bn8e8Wlp1M1Rw41z0q/Ut60hbBVT/zCRPX12BbH/W70oHvw5qB9+Gy9OjvkDrlLOiQiD8kheU2OyIAw8ol8Jl/I1+Bb8CP4GfxaoUFjXXObbETw+y86OxeP</latexit> E2 = ↵# Gender:f2(Age)(D) <latexit sha1_base64="bJJOqZd/k6cJtV5UgJsl0/znBVA=">AAADKHicbZJNj9MwEIbd8LFL+eqyRy4WFVL3UiWIjxWnFXDguEh0t6gJ1cR1WlPHjuwxpYryW7gCv4Yb2iv/AwmnrWCb7kiRXs3zju3MTFpIYTEML1rBtes3bu7t32rfvnP33v3OwYMzq51hfMC01GaYguVSKD5AgZIPC8MhTyU/T+eva37+mRsrtHqPy4InOUyVyAQD9Klx5zD2FNx0XA5fZr0PR1XvzdG40w374Srorog2oks2cTo+aP2JJ5q5nCtkEqwdRWGBSQkGBZO8asfO8gLYHKZ85KWCnNukXL2+oo99ZkIzbfynkK6ylytKyK1d5ql35oAz22R18io2cpgdJ6VQhUOu2PqizEmKmtatoBNhOEO59AKYEf6tlM3AAEPfsK1bUq3nCKmt2u1Y8QXTeQ5qUsZaV2WM/AumWamrahtmWI2ipPxn6EbVjuV/OTSZLjzkyjrD61+jcZpR3fDMdD077wNZzOBjGRsxnSEYoxfN49ZjvmT1fZPSj22hrvR/0kJ5d6oXKHiDOeV3x0NXSFf3xC9M1FyPXXH2pB897z9797R78mqzOvvkIXlEeiQiL8gJeUtOyYAwsiRfyTfyPfgR/Ax+BRdra9Da1BySrQh+/wVrcA35</latexit> ↵# X:f(Y )(D) DATAPLAT@ICDE 2024
  • 12. 12 Data transformation <latexit sha1_base64="XtRrctBkqIU93sb+UHrmtJtjUkA=">AAADHnicbZJNbxMxEIad5aMlfLVw5LIiQiqXaBdVwLGCC8cikTbSbojGjjdr4rVX9rghsvZncAV+DTfEFX4MEt40ArLpSJZezfuMP8ZDayksJsmvXnTt+o2be/u3+rfv3L13/+DwwZnVzjA+YlpqM6ZguRSKj1Cg5OPacKio5Od08br1zy+4sUKrd7iq+aSCuRKFYIAhleUIbuqLo/HTZnowSIbJOuJdkW7EgGzidHrY+53PNHMVV8gkWJulSY0TDwYFk7zp587yGtgC5jwLUkHF7cSv79zET0JmFhfahKUwXmf/r/BQWbuqaCArwNJ2vTZ5lZc5LF5OvFC1Q67Y5UGFkzHquG1APBOGM5SrIIAZEe4asxIMMAxt2jqFar1AoLbp93PFl0xXFaiZz7VufI78I9LC66bZNgtssnTi/wKDtNlB/pVD19N1MLmyzvD2aXFOi1h3mFIbcPPAgaxLeO9zI+YlgjF62d0ufP02GvomZfi2pbqS/6CFCjTVSxS84zkVJiaYrpau7UkYmLQ7Hrvi7NkwfT48fns8OHm1GZ198og8JkckJS/ICXlDTsmIMKLJJ/KZfIm+Rt+i79GPSzTqbWoekq2Ifv4B+VsLDw==</latexit> ⌧f(X) <latexit sha1_base64="Q7sjzw3r7FpZN6MWGMj9azYMGFk=">AAAD5HicbVJLb9NAELYbHiW8WjhyWREjFQlFccXrWAEHjkWiDykO0ex6N1663rX20RBZ/gfcEFf+Emd+DBKzaQQk7Vw8O9/3zYxnhjZKOj8a/Uq3eteu37i5fat/+87de/d3dh8cOxMs40fMKGNPKTiupOZHXnrFTxvLoaaKn9CztxE/OefWSaM/+kXDJzXMtBSSgcfQdOdnoY3UJdee+IqTwvMvmKX1FrQTxtZLWkeMIEAc99ERHHyw3JHsNIvv7F1GgpN6hhQRNIsKkomMFAWRjhjqAZsrCV0QF6jD9MFHNgdWkXNQgZOsnLayEF1G5tJXKN7DQAHOY+xp9ixmwmYuFKvyJCuwhGEsWBuzSR37GU53BqPhaGnkspOvnEGyssPpbvq7KA0LNY6AKXBunI8aP2nBeskU7/pFcLwBdgYzPkZXQ83dpF1OviNPMFIuexMGR7iM/q9ooXZuUVNk4igrt4nF4FXYOHjxetJK3QTPNbsoJIIi3pC4RlJKy5lXC3SAWYm9ElaBBeZx2WtVqDFnHqjr+v1C8zkzdQ26bAtjuna5bipa03XroPDdOJ+0fwmDvLtE+SeHTcw0CHLtcE/x10hBBTEbnMpYCDPkgWoq+NQWVs4qD9aa+WY6POB1Ks5NKVzbXF/J/4wnjWxq5l7yDSzoeND4bVSIM8GDyTfP47JzvD/MXw6ff9gfHLxZnc528ih5nOwlefIqOUjeJ4fJUcLSF+k4LVPeE72vvW+97xfUrXSleZisWe/HH5yBTeM=</latexit> the transformation of a set of features X of D using a function f is obtained by substituting each value dia with f(d⇤a), for each feature a occurring in X. Example: data imputation. Here f replaces nulls with the most frequent value, for column Zip <latexit sha1_base64="dKf0psuUtfBq7WDfOX5DpzZK5ls=">AAADKnicbZLNjtMwFIXd8DeUvw6IFZuICqmzqRI0ApYjYMFykOjMiCZUN67Tmjp2ZF9TKssPwxZ4GnYjtrwGEk6nAprOlSId3fPd2Lk5RS24wSQ570RXrl67fmPvZvfW7Tt37/X2758YZTVlI6qE0mcFGCa4ZCPkKNhZrRlUhWCnxeJV459+YtpwJd/hqmZ5BTPJS04BQ2vSe5gh2IkrB1mJ7j2v/YEfvD6Y9PrJMFlXvCvSjeiTTR1P9ju/s6mitmISqQBjxmlSY+5AI6eC+W5mDauBLmDGxkFKqJjJ3fr+Pn4SOtO4VDo8EuN19/8JB5Uxq6oIZAU4N22vaV7mjS2WL3LHZW2RSXpxUGlFjCpulhFPuWYUxSoIoJqHu8Z0DhoohpVtnVIotUAojO92M8mWVFUVyKnLlPIuQ/YZi9Ip77fNEv04zd1foJ/6HeTfOLQ9VQeTSWM1az4tzooyVi1mrjTYWeBA1HP44DLNZ3MErdWy/boQg2007E2I8NuW8lL+o+Iy0IVaImctz8qQnmDaWthmJyEwaTseu+Lk6TB9Njx8e9g/ermJzh55RB6TAUnJc3JE3pBjMiKUOPKFfCXfou/Rj+g8+nmBRp3NzAOyVdGvPwnyD0I=</latexit> ⌧f(Zip)(D) DATAPLAT@ICDE 2024
  • 13. 13 Data fusion: join and append <latexit sha1_base64="uo1XC2O2rrqRH/7jgx2X/lPakP4=">AAADKHicbZLNbtNAFIUn5q+Ev5Qu2YyIkFhFNqoKy6rtggWLgkhbKXai68k4HjKesWbuNESWn4Ut8DTsULe8BxLjNALi9EqWju75rmd8fdJSCotheNUJbt2+c/fezv3ug4ePHj/p7T49s9oZxodMS20uUrBcCsWHKFDyi9JwKFLJz9P5ceOfX3JjhVYfcVnypICZEplggL416e2djN/R+JMWaoyT6rimJ+MPk14/HISrotsiWos+WdfpZLfzO55q5gqukEmwdhSFJSYVGBRM8robO8tLYHOY8ZGXCgpuk2p1+5q+8J0pzbTxj0K66v4/UUFh7bJIPVkA5rbtNc2bvJHD7E1SCVU65IpdH5Q5SVHTZhV0KgxnKJdeADPC35WyHAww9AvbOCXVeo6Q2rrbjRVfMF0UoKZVrHVdxcg/Y5pVuq43zQzrUZRUf4F+VG8h/8ah7enSm1xZZ3jzaTROM6pbTK4NuJnnQJY5jKvYiFmOYIxetF/nQ7CJ+r1J6X/bQt3IN5HwdKoXKHjLc8pnx5uulK7ZiQ9M1I7Htjh7NYgOBvvv9/uHR+vo7JBn5Dl5SSLymhySt+SUDAkjS/KFfCXfgu/Bj+BncHWNBp31zB7ZqODXH8rzDh4=</latexit> DL ./t C DR <latexit sha1_base64="fiWoK5ivN8nYSDBQRhG2qdf4NTc=">AAADIXicbZJNbxMxEIad5auErxaOXCwiJE7RLqoKxwp64MChINJWym6qsePNmnjtlT0mRKv9H1yBX8MNcUP8FiS8aQRk05EsvZr3GX+Mh1VKOozjn73oytVr12/s3Ozfun3n7r3dvfsnznjLxYgbZewZAyeU1GKEEpU4q6yAkilxyuYvW//0g7BOGv0Ol5XISphpmUsOGFKTo8lrmnodJD2avD3fHcTDeBV0WyRrMSDrOD7f6/1Op4b7UmjkCpwbJ3GFWQ0WJVei6afeiQr4HGZiHKSGUrisXl27oY9DZkpzY8PSSFfZ/ytqKJ1bliyQJWDhul6bvMwbe8yfZ7XUlUeh+cVBuVcUDW17QKfSCo5qGQRwK8NdKS/AAsfQqY1TmDFzBOaafj/VYsFNWYKe1qkxTZ2i+Igsr03TbJo5NuMkq/8Cg6TZQv6VQ9czVTCFdt6K9mk0ZTk1HaYwFvwscKCqAiZ1auWsQLDWLLrbhd/fREPflArfttCX8u+N1IFmZoFSdLzVpATTV8q3PQkDk3THY1ucPB0mB8P9N/uDwxfr0dkhD8kj8oQk5Bk5JK/IMRkRTiz5RD6TL9HX6Fv0PfpxgUa9dc0DshHRrz8U/gvI</latexit> DL ] DR <latexit sha1_base64="ZSc/aIuuYda02WJ0QVQW8PzBr8E=">AAADIHicbZJNbxMxEIad5auErxaOXCwiJE7RLqqAYwU9cOBQEGkrZTfV2PFmTbz2Yo8J0Wp/B1fg13BDHOG/IOFNIyCbjmTp1bzP+GM8rFLSYRz/7EWXLl+5em3nev/GzVu37+zu3T12xlsuRtwoY08ZOKGkFiOUqMRpZQWUTIkTNn/R+icfhHXS6Le4rERWwkzLXHLAkMoOJ69Sr4Oih5M3Z7uDeBivgm6LZC0GZB1HZ3u93+nUcF8KjVyBc+MkrjCrwaLkSjT91DtRAZ/DTIyD1FAKl9WrWzf0YchMaW5sWBrpKvt/RQ2lc8uSBbIELFzXa5MXeWOP+bOslrryKDQ/Pyj3iqKhbQvoVFrBUS2DAG5luCvlBVjgGBq1cQozZo7AXNPvp1osuClL0NM6NaapUxQfkeW1aZpNM8dmnGT1X2CQNFvIv3LoeqYKptDOW9E+jaYsp6bDFMaCnwUOVFXApE6tnBUI1ppFd7vw+Zto6JtS4dsW+kL+nZE60MwsUIqOt5qUYPpK+bYnYWCS7nhsi+PHw+TJcP/1/uDg+Xp0dsh98oA8Igl5Sg7IS3JERoST9+QT+Uy+RF+jb9H36Mc5GvXWNffIRkS//gCWmQue</latexit> DL ] DR <latexit sha1_base64="Tf7s3qEix3yKzKbh9vcpsGLm1tk=">AAADSXicbVLdihMxGE2n/qz1r6uX3gSL4FWZkaLeCIu7FwperGJ3FzrTkkkzbWwmGZIv1hLyIj6Nt+oT+BjeiSCY6ZbVTveDgZNzzpdMvpy8EtxAHP9oRe0rV69d37vRuXnr9p273f17J0ZZTdmQKqH0WU4ME1yyIXAQ7KzSjJS5YKf54rDWTz8ybbiS72FVsawkM8kLTgkEatIdHI3fpB8Ul2OXmgJzKZn2ExfYflqAO3w99S+Oxu8uFh6H1aTbi/vxuvAuSDaghzZ1PNlv/UmnitqSSaCCGDNK4goyRzRwKpjvpNawitAFmbFRgJKUzGRufT2PHwVmigulwycBr9n/OxwpjVmVeXCWBOamqdXkZdrIQvE8c1xWFpik5wcVVmBQuJ4VnnLNKIhVAIRqHv4V0znRhEKY6NYpuVILILnxnU4q2ZKqsiRy6lKlvEuBfYK8cMr7bbEAP0oyd2HoJX7H8q+dNDVVBZFJYzWrr4bTvMCq4ZkrTews+Iio5iS8seazORCt1bK5XUjJtjXMTYjwbEt5qb8OTXDnagmcNTQrQ7iCaCth65mEwCTNeOyCkyf95Gl/8HbQO3i5ic4eeoAeoscoQc/QAXqFjtEQUfQZfUFf0bfoe/Qz+hX9PrdGrU3PfbRV7fZfdvcasg==</latexit> DL ./inner DL.CId=DR.CId DR DATAPLAT@ICDE 2024
  • 14. 14 Conceptual provenance capture model: templates <latexit sha1_base64="Q+fPf+TzQY7bxgC074TZYQmdfIg=">AAAKYHicjZZfb9s2EMDldn9Sr12T7W17IRYES7E1s4cWG/ZUZ83SAEXiFUlbIPYMSjrJRClSIym7hqAPucc97GWfZEfZiylK7SbAAI/3uzuSdzw6zDnTZjD4s3fr9gcffvTxzp3+J3fvfXp/d++zl1oWKoKrSHKpXodUA2cCrgwzHF7nCmgWcngVvvnZ6l8tQGkmxaVZ5TDNaCpYwiJqcGq2u5wIWEYyy6iIy0liquvhtCwnBt6aMCn3h1VV9RvIXCpapFU5oTyf09/KiWLp3FCl5NKia/WsTGbDQ3RXjlKoHvxE7JCm8IIKlKvDpw9mu/uDo0H9kfZguBnsB5tvPNvb+WMSy6jIQJiIU62vh4PcTEuqDIs4YOhCQ06jNxjmGoeCZqCnZX1CFTnAmZgkUuFPGFLPuhYlzbReZSGSGTVz7evsZJfuujDJj9OSibwwIKJ1oKTgxEhij5vETEFk+AoHNFIM10qiOVU0MpiUPn6Nw82VXODR2jDMlLXkHX8M+RawQuV7cO19rQyTrZqGumUtuWOOQuUvMN3qx6e+OYit9kT4WtzzVj1Cwdc7vkdpK7TNIALPN0Qteh6WabhykGV6vPIRJhJeYKbA5ag++3c6bpss46QJPwXFFhD/omTWYumywS5bmzRGXcoG0zoIO/VeIAYOKTXuHmgo227sto5XrZ1KlXXtU+MtdgOvZT/Fbg5A2BR4eTpxKvikVb/YnKSKwVYpDiP4/R36XLEMbqCv/WKARUadi7AWW0sRMoYtdG4lL5y9o1uilvwNcxqCcyvWor+etDgZ2dVi50uL74BWFSEHJAXxsNDYJojEHkywczHQ3xK8CGzB7Nh3whwvrHbjE/RkdHqDfIOIvSnNWEiSTKp3BiU5LzTBxiwMdqCDAyJzUNRI5S9HycI549NabN1Zj6JpJ6e7nM2wxFTrgYm41E4TqkUPUZA7GbESjVqJo2kTW8sdoIIlNmnXXy37dfDW2HLfltxa9sv3f60+c4NlmKautdfzTkN80UlKha+w8OmLerbbgnLebTTi/H12iJ9pybHtxI3l30z6eZT/nSPl3y4fGNt3d/vijC6f+cTZ+fkWmCywZc3B0Bk+ya3Kuri67ERlYVrs2Xk3y0SbjZtJj7uSPr547viLKCfjqsI/QUP/L0978PL7o+Hjo8Gvj/afHG/+Du0EXwZfBYfBMPgheBI8C8bBVRAFf/Vu9+727t35u7/Tv9/fW6O3ehubz4PG1//iH9y29FY=</latexit> ↵! f1(Age):ageRange(D) A different provenance template pt𝜏 is associated with each type 𝜏 of operator
  • 15. 15 Capturing provenance: bindings At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected Data items from the inputs and outputs of the operator are used to bind the variables in the template 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V} à {new values: F’, J, V’} + Binding rules <latexit sha1_base64="icVdmbcCfxxYOiITpBtlS3uqwUQ=">AAAD+HicdZNdb9MwFIaTlY8RPtbBJTdHVJQhVVWDJkCTKk2AJsbVkOjWqQ6V4zqtmWNHtrOuBP8X7hC3/Buu+R1IOGkFbTccKTo672O/yTnHccaZNp3OT3+jdu36jZubt4Lbd+7e26pv3z/WMleE9ojkUvVjrClngvYMM5z2M0VxGnN6Ep+9LvWTc6o0k+KDmWU0SvFYsIQRbFxqWP8VNAEZemGKA6nAAtsLAY2k0SD2mggFTQTVUyHVm5ki13RkgQrT3rMwQByLMadwAF3oD9MWHHZZC467b4YFa7mEBaTmxBcoAUBMQB8i+N/xYyqowmbJY8nkiXM5HU5a8K50Oe8mO3Mf+3TZxhGVzSlEQTCsNzrtTrXgchAugoa3WEfDbf+3qwHJU2dPONZ6EHYyExVYGUY4tQFyFcgwOcNjOnChwCnVUVF1w8LjsjyQuHImUhiosss7CpxqPUtjR6bYTPS6Viav0ga5SV5GBRNZbqggc6Mk52AklK2FEVOUGD5zASaKuW8FMsEKE+MGYMUllvLM4FjbIECCTolMUyxGBZLSzrsQJ4W0dlVMjB2EUfEXaIT2EvJvO17XZOZEKnSuaPlrgOIE5BozkQrnY8dhnk3wxwIpNp4YrJScrh/nhnoVdXXj3LVtKq7kP0kmHB3LqWF0TcuFuwtOzDOelzVxAxOuj8fl4PhZO3ze3n2/29h/tRidTe+h98jb8ULvhbfvvfWOvJ5H/ENf+hf+rPa59rX2rfZ9jm74iz0PvJVV+/EHk+tPwQ==</latexit> For i : 1 . . . n : used ent.:[hF = Xm, I = i, V = Di,Xm i|Xm 2 X] generated ent.:[hF0 = Yh, J = i, v = f(Di,X )i|Yh 2 Y ]
  • 16. 16 Implementation by shape and value diff Shape changes: Rows Added? Rows Removed? Columns Added? Columns Removed? Columns Removed? Horizontal Augmentation Reduction by selection Reduction by projection data transformation (composite) Y Y Y Y data transformation Y N N N Templates: N Value changes for each column: Nulls reduced? Values changed? Y Y N Templates: data transformation (imputation) data transformation 1-1 derivations For each input/output pair Din, Dout of dataframes: 1. Diff both shapes and values of Din, Dout 2. Use the diff to: • Select the appropriate template • Bind the template variables using the relevant values in the two dataframes • Generate an instantiated provlet DATAPLAT@ICDE 2024
  • 17. 17 Running Example D1 D2 D3 Add ‘E4,’ ‘Ex’, ‘E1’ Remove ‘E’ D4 D6 Da Db Left join (K1,K2) Impute all missing Dc Left join (K1,K2) Impute E,F D5 <latexit sha1_base64="vtTzVqyQbOaTVii0idD+QwwhSJQ=">AAAEKXicfVNbb9MwFE5WLqPcNvbIi0UFalE0NV03EFKl3Toh7WVI7CLVxXJcpzVz7Mh26IqV/8Ir8Gt4A175HUg4bRltN3GkREfn+853Ep/PUcqZNvX6D3+pdOPmrdvLd8p3791/8HBl9dGJlpki9JhILtVZhDXlTNBjwwynZ6miOIk4PY3O9wr89ANVmknx1oxS2k1wX7CYEWxcCa36a8DFM7CPwtY+wgC+l0y8s9AYwGlscmQPURgcokbuKBGE5b/0RgsanCEbo7D6vJZXnUBtBt5wao3/q5EZevNSrVFtBwdjvY1ZPbuZt+BAKpz1kR1U27VX0LZRMwBtdFG8QpgXPc3Znq0WTBmy0O4UnN0A7KBRAPYDsBeAg6Jppj0AEwE3p1ZGK5X6en0c4GoSTpOKN40jd4y/YU+SLKHCEI617oT11HQtVoYRTvMyzDRNMTnHfdpxqcAJ1V07Xl8OnrpKD8RSuUcYMK7OdlicaD1KIsdMsBnoRawoXod1MhO/7Fom0sxQQSaD4owDI0HhBdBjihLDRy7BRDH3rYAMsMLEOMfMTYmkPDc40nm5DAUdEpkkWPQslDJ326UXJoqtzPN50C28E3btJaES5lco/9rxIiZTB1KhM0WLXwMwioFc4Ewc4XiYpwPsnKZYf2CwUnK4KOduwTzVnRvnbm1DcS2/sK5jR3JoGF3AMuEujwOzlGfFmTjDhIv2uJqcNNbDrfXNN83K9u7UOsveY++JV/VC74W37b32jrxjj/gf/U/+Z/9L6WvpW+l76eeEuuRPe9a8uSj9+gO5hVWq</latexit> D1 = Da ./left K1,K2 Db D2 = ⌧f1(⇤)(D1) D3 = D2 ./left K1,K2 Dc D4 = ⌧f2(E,F )(D3) D5 = ↵! h(E):{E4,Ex,E1}(D4) D6 = ⇡{Ax,B,Ay,D,C,F,E4,Ex,E1,}(D5) DATAPLAT@ICDE 2024
  • 18. 18 Running Example D1 D2 D3 Add ‘E4,’ ‘Ex’, ‘E1’ Remove ‘E’ D4 D6 Da Db Left join (K1,K2) Impute all missing Dc Left join (K1,K2) Impute E,F D5 df = pd.merge(df_A, df_B, on=['key1', 'key2'], how='left’) # join df = df.fillna('imputed’) # Imputation df = pd.merge(df, df_C, on=['key1', 'key2'], how='left’) #join df = df.fillna(value={'E':'Ex', 'F':'Fx’}) # Imputation # one-hot encoding c = 'E' dummies = [] dummies.append(pd.get_dummies(df[c])) df_dummies = pd.concat(dummies, axis=1) df = pd.concat((df, df_dummies), axis=1) df = df_A.drop([c], axis=1) DATAPLAT@ICDE 2024
  • 19. 19 Running Example D1 D2 D3 Add ‘E4,’ ‘Ex’, ‘E1’ Remove ‘E’ D4 D6 Da Db Left join (K1,K2) Impute all missing Dc Left join (K1,K2) Impute E,F D5 Dataframes Diff template D1 ß {Da, Db} Explicit join provenance pattern D2 ß D1 value change, reduced nulls à imputation Data transformation D3 ß {D2, Dc} Explicit join provenance pattern D4 ß D3 value change, reduced nulls à imputation Data transformation D45 ß D4 Shape change, column(s) added <wait!> D6 ß D5 Shape change, column(s) removed Data transformation, composite DATAPLAT@ICDE 2024
  • 20. 20 Program level transparency with control Approach: - add an observer to monitor dataframe changes - mostly transparent to application - some control over Tracker surfaced DATAPLAT@ICDE 2024
  • 21. 21 Provenance traversals – example Capture, store and query element-level provenance - Derivation of each element of each intermediate dataframe (when possible) - Efficiently, at scale fillna Join df_1 df_B (df_0) df_A (df_-1) DATAPLAT@ICDE 2024
  • 22. 22 Benchmarking: data x pipelines Datasets: Pipelines: Provenance graphs are stored in a single Neo4J database DATAPLAT@ICDE 2024
  • 23. 23 Results The PT/PO ratio provides a rough indication of scalability: - The graphs for the complete pipelines are close in size to the sum of the sizes of the components’ graphs 1,2,3: pipeline number DATAPLAT@ICDE 2024
  • 24. 24 Conclusions ü DPDS generates granular provenance graphs that accurately represent the underlying data processing ü A potentially useful building block towards explanations in a Data Centric AI setting Limitations: v No granularity control --> limited scalability v Operates only on Pandas dataframes DATAPLAT@ICDE 2024