SlideShare a Scribd company logo
Luca Gregori∗, Paolo Missier†, Matthew Stidolph‡, Riccardo Torlone∗ and Alessandro Wood∗
∗Department of Engineering, Roma Tre University, Italy
†School of Computer Science, University of Birmingham, UK
‡Newcastle University, School of Computing
DATAPLAT@ICDE
May 2024
Utrecht, NL
Design and Development of a Provenance
Capture Platform for Data Science
2
Setting and questions
Model
outputs
Training
datasets
Source
datasets
Data processing Training M Inference/generation
Data explanation questions:
• Which data transformations were applied to raw input dataset(s) to generate the final
training set used for modelling?
• Which of the individual data items were affected by each of the transformations
• What was the effect?
DATAPLAT@ICDE
2024
3
Provenance basics
Abstract data transformation operator: 𝐷 → (OP) → 𝐷ʹ
D D’
A
wasGeneratedBy
wasDerivedFrom
used
Provenance expression:
DATAPLAT@ICDE
2024
4
Extension to DAG topologies
Example: inputs 𝐷0
𝑎, 𝐷0
𝑏 Dc
0 are processed independently and eventually merged into 𝐷𝑛:
Da
0 OP1 Da
1
Db
0 OP2 Db
1
Dc
0
OP3 Dbc
0
OP4 Dabc
3
Da
0 OP1 Da
1
Db
0 OP2 Db
1
Dc
0
OP3 Dbc
0
OP4 Dabc
3
used
used
used
used
wgby
wgby
DATAPLAT@ICDE
2024
5
The Big Provenance Dogma
Data provenance is an enabler for:
• Transparency
• Explainability
• Reproducibility
…for a variety of underlying process and source / target data combinations
Model
outputs
Training
datasets
Data processing Training M Inference/generation
Source Target
Process
DATAPLAT@ICDE
2024
6
DATAPLAT@ICDE
2024
Contributions
ü Analysis of over 500 Data Science pipeline
§ “in the wild” --> Kaggle
§ “controlled” --> ML Bazaar
ü Formal provenance semantics for a catalogue of commonly used Data Science operators
ü Data Provenance for Data Science (DPDS)
§ automatically track granular provenance from Pandas
§ maximally transparent and minimally intrusive to the programmer
ü Empirical evaluation against a grid of 3 benchmark datasets x 3 synthetic pipelines
7
Data processing pipelines analysis: ML Bazaar
ü Facilitates developing ML and AutoML systems
ü Workflow style: Pipelines composed out of pre-defined primitives
ü Data + task pairs with benchmark results over multiple data types
✗ Only 5 types of operators
✗ Single location, controlled ecosystem
DFS = Deep Feature Synthesis
DATAPLAT@ICDE
2024
8
Data processing pipelines analysis: Kaggle
Scope: top 200 most upvoted python notebooks related to machine learning on Kaggle
Ø 29 unique pre-processing operations
Ø 12 appear in less than 10 pipelines
§ Transposing
§ changing index values
Ø feature augmentation (58)
Ø scaling operations (38)
DATAPLAT@ICDE
2024
9
Data processing operators
DATAPLAT@ICDE
2024
10
Data reduction
<latexit sha1_base64="caGX98B8rPEaUMv/+I4c5iOo7DY=">AAADNnicbVLLjtMwFHXDawivDizZGCrEIKEqQSNggzRiZsFykOjMSE2prl2nNXXsyL6mVFHWfA1b4FfYsENs+QEk3IeApnMlSyfnnGvH14eVSjpMkm+t6MLFS5ev7FyNr12/cfNWe/f2iTPectHjRhl7xsAJJbXooUQlzkoroGBKnLLp4UI/fS+sk0a/wXkpBgWMtcwlBwzUsH3v6CF9QbNSDg/3jh49pnT17eS4gCUVD9udpJssi26DdA06ZF3Hw93W72xkuC+ERq7AuX6alDiowKLkStRx5p0ogU9hLPoBaiiEG1TLu9T0QWBGNDc2LI10yf7fUUHh3LxgwVkATlxTW5DnaX2P+fNBJXXpUWi+Oij3iqKhi8HQkbSCo5oHANzK8K+UT8ACxzC+jVOYMVME5uo4zrSYcVMUoEdVZkxdZSg+IMsrU9ebYo51Px1Ufw2dtN6y/GuHpmbKIArtvBWLq9GM5dQ0PBNjwY+DD1Q5gbdVZuV4gmCtmTW3C5HYtIa5KRWebabP9b8zUgc3MzOUoqF5HZIURF8qv5hJCEzajMc2OHnSTZ9291/vdw5erqOzQ+6S+2SPpOQZOSCvyDHpEU4+kk/kM/kSfY2+Rz+inytr1Fr33CEbFf36A9XlEGY=</latexit>
D0
= ⇡C(D), D0
= C(D) - Projection, Selection
<latexit sha1_base64="fFqxFPpIMzZxYgmMXJgJEbRtTTU=">AAADX3icbVJdixMxFE1bddeqa1efxJdgEbogZcbvBx9WV1B8WsHuLjS1ZDJ3prGZZEgy1hLyn/w1gk/qDxFMP9i1070QOHPPucnk5CSl4MZG0c9Gs3Xl6rWd3evtGzdv7d3u7N85MarSDAZMCaXPEmpAcAkDy62As1IDLRIBp8n0aMGffgVtuJKf7LyEUUFzyTPOqA2tcecDKfnYEUcsfLPuiKf+EV7hdyBT0Oefr3PwxPseMTwv6NhddF89iXzv7cHBuNON+tGy8DaI16CL1nU83m/8JaliVQHSMkGNGcZRaUeOasuZAN8mlYGSsinNYRigpAWYkVte2uOHoZPiTOmwpMXL7v8TjhbGzIskKAtqJ6bOLZqXccPKZi9HjsuysiDZ6qCsEtgqvHAQp1wDs2IeAGWah3/FbEI1ZTb4vHFKotTU0sT4dptImDFVFFSmjijlV/4lmVPeb5KZ9cN45M4F3dhvSS7GaZ1TZSBBmkrD4mqYJBlWNc1EaVrlQUdFOaGfHdE8n1iqtZrVtwvZ2ZQG34QIzzaTl+q/KC6DOlEzy6HGVTJELpBVKaqFJyEwcT0e2+DkcT9+3n/28Wn38M06OrvoPnqAeihGL9Aheo+O0QAx9B39QL/Q7+af1k5rr9VZSZuN9cxdtFGte/8AGM4hrw==</latexit>
⇡{Cid,Gender,Age}( Age<30(D))
DATAPLAT@ICDE
2024
11
Data augmentation
Vertical augmentation
<latexit sha1_base64="Jkv8keMS0FhjcfbzwX5TGOOML7Q=">AAADJnicbVJNj9MwEHXDxy7lq4UjF4sKablUCVoB2tMKLhwXie4WNaWauE5j6tiRPSZUUX4KV+DXcEOIG38ECaetgKY7kqWneW884/FLCikshuHPTnDl6rXrB4c3ujdv3b5zt9e/d261M4yPmJbajBOwXArFRyhQ8nFhOOSJ5BfJ8mXDX3zgxgqt3uCq4NMcFkqkggH61KzXjzNtwC1mVXo0fnxC39az3iAchuug+yDaggHZxtms3/kdzzVzOVfIJFg7icICpxUYFEzyuhs7ywtgS1jwiYcKcm6n1Xr2mj7ymTlNtfFHIV1n/6+oILd2lSdemQNmts01ycu4icP0+bQSqnDIFds0Sp2kqGmzCDoXhjOUKw+AGeFnpSwDAwz9una6JFovERJbd7ux4iXTeQ5qXsVa11WM/CMmaaXrepdMsZ5E0+qvYBDVe5J/5dDmdOFJrqwzvHkajZOU6pZm83NeB7LI4F0VG7HIEIzRZfs6b4Fdqd+blP7bSnWp/r0WyqsTXaLgLc4p7xxPukK6ZifeMFHbHvvg/Mkwejo8fn08OH2xtc4heUAekiMSkWfklLwiZ2REGCnJJ/KZfAm+Bt+C78GPjTTobGvuk50Ifv0By7kNaw==</latexit>
↵!
f(X):Y
<latexit sha1_base64="KZhlQb7RQuvbDZlIWBGjNXy9o1c=">AAADPnicbVLLjhMxEHSGxy7hlYUjF4sIKblEGbSCFaflceC4ILK7UiZEPY4nMfHYI7tNiKz5Br6GK/Ab/AA3xJULEp4kAjLZliyVu6rddrvSQgqL/f63RnTp8pWre/vXmtdv3Lx1u3Vw59RqZxgfMC21OU/BcikUH6BAyc8LwyFPJT9L588r/uw9N1Zo9QaXBR/lMFUiEwwwpMatbjLTBtx07LNx3Eky9E+nvOw+oRWEKX8NKuzLzovuuNXu9/qroLsg3oA22cTJ+KDxO5lo5nKukEmwdhj3Cxx5MCiY5GUzcZYXwOahzTBABTm3I796U0kfhMyEZtqEpZCusv9XeMitXeZpUOaAM1vnquRF3NBhdjTyQhUOuWLrRpmTFDWtBkQnwnCGchkAMCPCXSmbgQGGYYxbXVKt5wipLZvNRPEF03kOauITrUufIP+AaeZ1WW6TGZbDeOT/CtpxuSP5Vw51TheB5Mo6w6un0STNqK5p1j8adCCLGbz1iRHTGYIxelE/LlhjWxrmJmX4toW6UP9OCxXUqV6g4DXOqeCoQLpCumomwTBx3R674PRhL37UO3x12D5+trHOPrlH7pMOicljckxekhMyIIx8JJ/IZ/Il+hp9j35EP9fSqLGpuUu2Ivr1B33rF1I=</latexit>
↵!
f1(Age):ageRange(D)
group by gender
avg(age)
Horizontal augmentation
<latexit sha1_base64="/Fez8VR4cSmlF01/YiVQsD5zSEs=">AAADP3icbZLNjtMwEMfd8LWUj+3CkUtEhdTlUDXVChAS0vIlOC4S3V2pKZHjTlpTx47sMaWK/A48DVfgMXgCbogrBySctoJtuiNF+mf+v4njmUkLwQ32et8bwYWLly5f2bnavHb9xs3d1t6tY6OsZjBgSih9mlIDgksYIEcBp4UGmqcCTtLZ88o/+QDacCXf4qKAUU4nkmecUfSppHX/ZdJ/EnuC2klSxhmWr0COQbvHWdLvVO9PJ+D2XefFftJq97q9ZYTbIlqLNlnHUbLX+BOPFbM5SGSCGjOMegWOSqqRMwGuGVsDBWUzOoGhl5LmYEbl8lIuvOcz4zBT2j8Sw2X2bEVJc2MWeerJnOLU1L0qeZ43tJg9GpVcFhZBstVBmRUhqrDqUDjmGhiKhReUae7/NWRTqilD38eNU1KlZkhT45rNWMKcqTynclzGSrkyRviIaVYq5zbNDN0wGpX/gHbktpD/5bTuqcKbII3VUF0tjNMsVDVmqqpxeo6KYkrflbHmkylSrdW8/rnV5M+gvm9C+LHN5bn8e8Wlp1M1Rw41z0q/Ut60hbBVT/zCRPX12BbH/W70oHvw5qB9+Gy9OjvkDrlLOiQiD8kheU2OyIAw8ol8Jl/I1+Bb8CP4GfxaoUFjXXObbETw+y86OxeP</latexit>
E2 = ↵#
Gender:f2(Age)(D)
<latexit sha1_base64="bJJOqZd/k6cJtV5UgJsl0/znBVA=">AAADKHicbZJNj9MwEIbd8LFL+eqyRy4WFVL3UiWIjxWnFXDguEh0t6gJ1cR1WlPHjuwxpYryW7gCv4Yb2iv/AwmnrWCb7kiRXs3zju3MTFpIYTEML1rBtes3bu7t32rfvnP33v3OwYMzq51hfMC01GaYguVSKD5AgZIPC8MhTyU/T+eva37+mRsrtHqPy4InOUyVyAQD9Klx5zD2FNx0XA5fZr0PR1XvzdG40w374Srorog2oks2cTo+aP2JJ5q5nCtkEqwdRWGBSQkGBZO8asfO8gLYHKZ85KWCnNukXL2+oo99ZkIzbfynkK6ylytKyK1d5ql35oAz22R18io2cpgdJ6VQhUOu2PqizEmKmtatoBNhOEO59AKYEf6tlM3AAEPfsK1bUq3nCKmt2u1Y8QXTeQ5qUsZaV2WM/AumWamrahtmWI2ipPxn6EbVjuV/OTSZLjzkyjrD61+jcZpR3fDMdD077wNZzOBjGRsxnSEYoxfN49ZjvmT1fZPSj22hrvR/0kJ5d6oXKHiDOeV3x0NXSFf3xC9M1FyPXXH2pB897z9797R78mqzOvvkIXlEeiQiL8gJeUtOyYAwsiRfyTfyPfgR/Ax+BRdra9Da1BySrQh+/wVrcA35</latexit>
↵#
X:f(Y )(D)
DATAPLAT@ICDE
2024
12
Data transformation
<latexit sha1_base64="XtRrctBkqIU93sb+UHrmtJtjUkA=">AAADHnicbZJNbxMxEIad5aMlfLVw5LIiQiqXaBdVwLGCC8cikTbSbojGjjdr4rVX9rghsvZncAV+DTfEFX4MEt40ArLpSJZezfuMP8ZDayksJsmvXnTt+o2be/u3+rfv3L13/+DwwZnVzjA+YlpqM6ZguRSKj1Cg5OPacKio5Od08br1zy+4sUKrd7iq+aSCuRKFYIAhleUIbuqLo/HTZnowSIbJOuJdkW7EgGzidHrY+53PNHMVV8gkWJulSY0TDwYFk7zp587yGtgC5jwLUkHF7cSv79zET0JmFhfahKUwXmf/r/BQWbuqaCArwNJ2vTZ5lZc5LF5OvFC1Q67Y5UGFkzHquG1APBOGM5SrIIAZEe4asxIMMAxt2jqFar1AoLbp93PFl0xXFaiZz7VufI78I9LC66bZNgtssnTi/wKDtNlB/pVD19N1MLmyzvD2aXFOi1h3mFIbcPPAgaxLeO9zI+YlgjF62d0ufP02GvomZfi2pbqS/6CFCjTVSxS84zkVJiaYrpau7UkYmLQ7Hrvi7NkwfT48fns8OHm1GZ198og8JkckJS/ICXlDTsmIMKLJJ/KZfIm+Rt+i79GPSzTqbWoekq2Ifv4B+VsLDw==</latexit>
⌧f(X)
<latexit sha1_base64="Q7sjzw3r7FpZN6MWGMj9azYMGFk=">AAAD5HicbVJLb9NAELYbHiW8WjhyWREjFQlFccXrWAEHjkWiDykO0ex6N1663rX20RBZ/gfcEFf+Emd+DBKzaQQk7Vw8O9/3zYxnhjZKOj8a/Uq3eteu37i5fat/+87de/d3dh8cOxMs40fMKGNPKTiupOZHXnrFTxvLoaaKn9CztxE/OefWSaM/+kXDJzXMtBSSgcfQdOdnoY3UJdee+IqTwvMvmKX1FrQTxtZLWkeMIEAc99ERHHyw3JHsNIvv7F1GgpN6hhQRNIsKkomMFAWRjhjqAZsrCV0QF6jD9MFHNgdWkXNQgZOsnLayEF1G5tJXKN7DQAHOY+xp9ixmwmYuFKvyJCuwhGEsWBuzSR37GU53BqPhaGnkspOvnEGyssPpbvq7KA0LNY6AKXBunI8aP2nBeskU7/pFcLwBdgYzPkZXQ83dpF1OviNPMFIuexMGR7iM/q9ooXZuUVNk4igrt4nF4FXYOHjxetJK3QTPNbsoJIIi3pC4RlJKy5lXC3SAWYm9ElaBBeZx2WtVqDFnHqjr+v1C8zkzdQ26bAtjuna5bipa03XroPDdOJ+0fwmDvLtE+SeHTcw0CHLtcE/x10hBBTEbnMpYCDPkgWoq+NQWVs4qD9aa+WY6POB1Ks5NKVzbXF/J/4wnjWxq5l7yDSzoeND4bVSIM8GDyTfP47JzvD/MXw6ff9gfHLxZnc528ih5nOwlefIqOUjeJ4fJUcLSF+k4LVPeE72vvW+97xfUrXSleZisWe/HH5yBTeM=</latexit>
the transformation of a set of features X of D using a function f
is obtained by substituting each value dia with f(d⇤a),
for each feature a occurring in X.
Example: data imputation. Here f replaces nulls with the most frequent value, for
column Zip
<latexit sha1_base64="dKf0psuUtfBq7WDfOX5DpzZK5ls=">AAADKnicbZLNjtMwFIXd8DeUvw6IFZuICqmzqRI0ApYjYMFykOjMiCZUN67Tmjp2ZF9TKssPwxZ4GnYjtrwGEk6nAprOlSId3fPd2Lk5RS24wSQ570RXrl67fmPvZvfW7Tt37/X2758YZTVlI6qE0mcFGCa4ZCPkKNhZrRlUhWCnxeJV459+YtpwJd/hqmZ5BTPJS04BQ2vSe5gh2IkrB1mJ7j2v/YEfvD6Y9PrJMFlXvCvSjeiTTR1P9ju/s6mitmISqQBjxmlSY+5AI6eC+W5mDauBLmDGxkFKqJjJ3fr+Pn4SOtO4VDo8EuN19/8JB5Uxq6oIZAU4N22vaV7mjS2WL3LHZW2RSXpxUGlFjCpulhFPuWYUxSoIoJqHu8Z0DhoohpVtnVIotUAojO92M8mWVFUVyKnLlPIuQ/YZi9Ip77fNEv04zd1foJ/6HeTfOLQ9VQeTSWM1az4tzooyVi1mrjTYWeBA1HP44DLNZ3MErdWy/boQg2007E2I8NuW8lL+o+Iy0IVaImctz8qQnmDaWthmJyEwaTseu+Lk6TB9Njx8e9g/ermJzh55RB6TAUnJc3JE3pBjMiKUOPKFfCXfou/Rj+g8+nmBRp3NzAOyVdGvPwnyD0I=</latexit>
⌧f(Zip)(D)
DATAPLAT@ICDE
2024
13
Data fusion: join and append
<latexit sha1_base64="uo1XC2O2rrqRH/7jgx2X/lPakP4=">AAADKHicbZLNbtNAFIUn5q+Ev5Qu2YyIkFhFNqoKy6rtggWLgkhbKXai68k4HjKesWbuNESWn4Ut8DTsULe8BxLjNALi9EqWju75rmd8fdJSCotheNUJbt2+c/fezv3ug4ePHj/p7T49s9oZxodMS20uUrBcCsWHKFDyi9JwKFLJz9P5ceOfX3JjhVYfcVnypICZEplggL416e2djN/R+JMWaoyT6rimJ+MPk14/HISrotsiWos+WdfpZLfzO55q5gqukEmwdhSFJSYVGBRM8robO8tLYHOY8ZGXCgpuk2p1+5q+8J0pzbTxj0K66v4/UUFh7bJIPVkA5rbtNc2bvJHD7E1SCVU65IpdH5Q5SVHTZhV0KgxnKJdeADPC35WyHAww9AvbOCXVeo6Q2rrbjRVfMF0UoKZVrHVdxcg/Y5pVuq43zQzrUZRUf4F+VG8h/8ah7enSm1xZZ3jzaTROM6pbTK4NuJnnQJY5jKvYiFmOYIxetF/nQ7CJ+r1J6X/bQt3IN5HwdKoXKHjLc8pnx5uulK7ZiQ9M1I7Htjh7NYgOBvvv9/uHR+vo7JBn5Dl5SSLymhySt+SUDAkjS/KFfCXfgu/Bj+BncHWNBp31zB7ZqODXH8rzDh4=</latexit>
DL
./t
C DR
<latexit sha1_base64="fiWoK5ivN8nYSDBQRhG2qdf4NTc=">AAADIXicbZJNbxMxEIad5auErxaOXCwiJE7RLqoKxwp64MChINJWym6qsePNmnjtlT0mRKv9H1yBX8MNcUP8FiS8aQRk05EsvZr3GX+Mh1VKOozjn73oytVr12/s3Ozfun3n7r3dvfsnznjLxYgbZewZAyeU1GKEEpU4q6yAkilxyuYvW//0g7BOGv0Ol5XISphpmUsOGFKTo8lrmnodJD2avD3fHcTDeBV0WyRrMSDrOD7f6/1Op4b7UmjkCpwbJ3GFWQ0WJVei6afeiQr4HGZiHKSGUrisXl27oY9DZkpzY8PSSFfZ/ytqKJ1bliyQJWDhul6bvMwbe8yfZ7XUlUeh+cVBuVcUDW17QKfSCo5qGQRwK8NdKS/AAsfQqY1TmDFzBOaafj/VYsFNWYKe1qkxTZ2i+Igsr03TbJo5NuMkq/8Cg6TZQv6VQ9czVTCFdt6K9mk0ZTk1HaYwFvwscKCqAiZ1auWsQLDWLLrbhd/fREPflArfttCX8u+N1IFmZoFSdLzVpATTV8q3PQkDk3THY1ucPB0mB8P9N/uDwxfr0dkhD8kj8oQk5Bk5JK/IMRkRTiz5RD6TL9HX6Fv0PfpxgUa9dc0DshHRrz8U/gvI</latexit>
DL
] DR
<latexit sha1_base64="ZSc/aIuuYda02WJ0QVQW8PzBr8E=">AAADIHicbZJNbxMxEIad5auErxaOXCwiJE7RLqqAYwU9cOBQEGkrZTfV2PFmTbz2Yo8J0Wp/B1fg13BDHOG/IOFNIyCbjmTp1bzP+GM8rFLSYRz/7EWXLl+5em3nev/GzVu37+zu3T12xlsuRtwoY08ZOKGkFiOUqMRpZQWUTIkTNn/R+icfhHXS6Le4rERWwkzLXHLAkMoOJ69Sr4Oih5M3Z7uDeBivgm6LZC0GZB1HZ3u93+nUcF8KjVyBc+MkrjCrwaLkSjT91DtRAZ/DTIyD1FAKl9WrWzf0YchMaW5sWBrpKvt/RQ2lc8uSBbIELFzXa5MXeWOP+bOslrryKDQ/Pyj3iqKhbQvoVFrBUS2DAG5luCvlBVjgGBq1cQozZo7AXNPvp1osuClL0NM6NaapUxQfkeW1aZpNM8dmnGT1X2CQNFvIv3LoeqYKptDOW9E+jaYsp6bDFMaCnwUOVFXApE6tnBUI1ppFd7vw+Zto6JtS4dsW+kL+nZE60MwsUIqOt5qUYPpK+bYnYWCS7nhsi+PHw+TJcP/1/uDg+Xp0dsh98oA8Igl5Sg7IS3JERoST9+QT+Uy+RF+jb9H36Mc5GvXWNffIRkS//gCWmQue</latexit>
DL
] DR
<latexit sha1_base64="Tf7s3qEix3yKzKbh9vcpsGLm1tk=">AAADSXicbVLdihMxGE2n/qz1r6uX3gSL4FWZkaLeCIu7FwperGJ3FzrTkkkzbWwmGZIv1hLyIj6Nt+oT+BjeiSCY6ZbVTveDgZNzzpdMvpy8EtxAHP9oRe0rV69d37vRuXnr9p273f17J0ZZTdmQKqH0WU4ME1yyIXAQ7KzSjJS5YKf54rDWTz8ybbiS72FVsawkM8kLTgkEatIdHI3fpB8Ul2OXmgJzKZn2ExfYflqAO3w99S+Oxu8uFh6H1aTbi/vxuvAuSDaghzZ1PNlv/UmnitqSSaCCGDNK4goyRzRwKpjvpNawitAFmbFRgJKUzGRufT2PHwVmigulwycBr9n/OxwpjVmVeXCWBOamqdXkZdrIQvE8c1xWFpik5wcVVmBQuJ4VnnLNKIhVAIRqHv4V0znRhEKY6NYpuVILILnxnU4q2ZKqsiRy6lKlvEuBfYK8cMr7bbEAP0oyd2HoJX7H8q+dNDVVBZFJYzWrr4bTvMCq4ZkrTews+Iio5iS8seazORCt1bK5XUjJtjXMTYjwbEt5qb8OTXDnagmcNTQrQ7iCaCth65mEwCTNeOyCkyf95Gl/8HbQO3i5ic4eeoAeoscoQc/QAXqFjtEQUfQZfUFf0bfoe/Qz+hX9PrdGrU3PfbRV7fZfdvcasg==</latexit>
DL
./inner
DL.CId=DR.CId DR
DATAPLAT@ICDE
2024
14
Conceptual provenance capture model: templates
<latexit sha1_base64="Q+fPf+TzQY7bxgC074TZYQmdfIg=">AAAKYHicjZZfb9s2EMDldn9Sr12T7W17IRYES7E1s4cWG/ZUZ83SAEXiFUlbIPYMSjrJRClSIym7hqAPucc97GWfZEfZiylK7SbAAI/3uzuSdzw6zDnTZjD4s3fr9gcffvTxzp3+J3fvfXp/d++zl1oWKoKrSHKpXodUA2cCrgwzHF7nCmgWcngVvvnZ6l8tQGkmxaVZ5TDNaCpYwiJqcGq2u5wIWEYyy6iIy0liquvhtCwnBt6aMCn3h1VV9RvIXCpapFU5oTyf09/KiWLp3FCl5NKia/WsTGbDQ3RXjlKoHvxE7JCm8IIKlKvDpw9mu/uDo0H9kfZguBnsB5tvPNvb+WMSy6jIQJiIU62vh4PcTEuqDIs4YOhCQ06jNxjmGoeCZqCnZX1CFTnAmZgkUuFPGFLPuhYlzbReZSGSGTVz7evsZJfuujDJj9OSibwwIKJ1oKTgxEhij5vETEFk+AoHNFIM10qiOVU0MpiUPn6Nw82VXODR2jDMlLXkHX8M+RawQuV7cO19rQyTrZqGumUtuWOOQuUvMN3qx6e+OYit9kT4WtzzVj1Cwdc7vkdpK7TNIALPN0Qteh6WabhykGV6vPIRJhJeYKbA5ag++3c6bpss46QJPwXFFhD/omTWYumywS5bmzRGXcoG0zoIO/VeIAYOKTXuHmgo227sto5XrZ1KlXXtU+MtdgOvZT/Fbg5A2BR4eTpxKvikVb/YnKSKwVYpDiP4/R36XLEMbqCv/WKARUadi7AWW0sRMoYtdG4lL5y9o1uilvwNcxqCcyvWor+etDgZ2dVi50uL74BWFSEHJAXxsNDYJojEHkywczHQ3xK8CGzB7Nh3whwvrHbjE/RkdHqDfIOIvSnNWEiSTKp3BiU5LzTBxiwMdqCDAyJzUNRI5S9HycI549NabN1Zj6JpJ6e7nM2wxFTrgYm41E4TqkUPUZA7GbESjVqJo2kTW8sdoIIlNmnXXy37dfDW2HLfltxa9sv3f60+c4NlmKautdfzTkN80UlKha+w8OmLerbbgnLebTTi/H12iJ9pybHtxI3l30z6eZT/nSPl3y4fGNt3d/vijC6f+cTZ+fkWmCywZc3B0Bk+ya3Kuri67ERlYVrs2Xk3y0SbjZtJj7uSPr547viLKCfjqsI/QUP/L0978PL7o+Hjo8Gvj/afHG/+Du0EXwZfBYfBMPgheBI8C8bBVRAFf/Vu9+727t35u7/Tv9/fW6O3ehubz4PG1//iH9y29FY=</latexit>
↵!
f1(Age):ageRange(D)
A different provenance template pt𝜏 is associated with each type 𝜏 of operator
15
Capturing provenance: bindings
At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V} à {new values: F’, J, V’}
+
Binding rules
<latexit sha1_base64="icVdmbcCfxxYOiITpBtlS3uqwUQ=">AAAD+HicdZNdb9MwFIaTlY8RPtbBJTdHVJQhVVWDJkCTKk2AJsbVkOjWqQ6V4zqtmWNHtrOuBP8X7hC3/Buu+R1IOGkFbTccKTo672O/yTnHccaZNp3OT3+jdu36jZubt4Lbd+7e26pv3z/WMleE9ojkUvVjrClngvYMM5z2M0VxGnN6Ep+9LvWTc6o0k+KDmWU0SvFYsIQRbFxqWP8VNAEZemGKA6nAAtsLAY2k0SD2mggFTQTVUyHVm5ki13RkgQrT3rMwQByLMadwAF3oD9MWHHZZC467b4YFa7mEBaTmxBcoAUBMQB8i+N/xYyqowmbJY8nkiXM5HU5a8K50Oe8mO3Mf+3TZxhGVzSlEQTCsNzrtTrXgchAugoa3WEfDbf+3qwHJU2dPONZ6EHYyExVYGUY4tQFyFcgwOcNjOnChwCnVUVF1w8LjsjyQuHImUhiosss7CpxqPUtjR6bYTPS6Viav0ga5SV5GBRNZbqggc6Mk52AklK2FEVOUGD5zASaKuW8FMsEKE+MGYMUllvLM4FjbIECCTolMUyxGBZLSzrsQJ4W0dlVMjB2EUfEXaIT2EvJvO17XZOZEKnSuaPlrgOIE5BozkQrnY8dhnk3wxwIpNp4YrJScrh/nhnoVdXXj3LVtKq7kP0kmHB3LqWF0TcuFuwtOzDOelzVxAxOuj8fl4PhZO3ze3n2/29h/tRidTe+h98jb8ULvhbfvvfWOvJ5H/ENf+hf+rPa59rX2rfZ9jm74iz0PvJVV+/EHk+tPwQ==</latexit>
For i : 1 . . . n :
used ent.:[hF = Xm, I = i, V = Di,Xm
i|Xm 2 X]
generated ent.:[hF0
= Yh, J = i, v = f(Di,X )i|Yh 2 Y ]
16
Implementation by shape and value diff
Shape changes:
Rows
Added?
Rows
Removed?
Columns
Added?
Columns
Removed?
Columns
Removed?
Horizontal
Augmentation
Reduction
by selection
Reduction
by projection
data
transformation
(composite)
Y
Y
Y
Y
data
transformation
Y
N
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:
data
transformation
(imputation)
data
transformation
1-1 derivations
For each input/output pair Din, Dout of dataframes:
1. Diff both shapes and values of Din, Dout
2. Use the diff to:
• Select the appropriate template
• Bind the template variables using the
relevant values in the two dataframes
• Generate an instantiated provlet
DATAPLAT@ICDE
2024
17
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
<latexit sha1_base64="vtTzVqyQbOaTVii0idD+QwwhSJQ=">AAAEKXicfVNbb9MwFE5WLqPcNvbIi0UFalE0NV03EFKl3Toh7WVI7CLVxXJcpzVz7Mh26IqV/8Ir8Gt4A175HUg4bRltN3GkREfn+853Ep/PUcqZNvX6D3+pdOPmrdvLd8p3791/8HBl9dGJlpki9JhILtVZhDXlTNBjwwynZ6miOIk4PY3O9wr89ANVmknx1oxS2k1wX7CYEWxcCa36a8DFM7CPwtY+wgC+l0y8s9AYwGlscmQPURgcokbuKBGE5b/0RgsanCEbo7D6vJZXnUBtBt5wao3/q5EZevNSrVFtBwdjvY1ZPbuZt+BAKpz1kR1U27VX0LZRMwBtdFG8QpgXPc3Znq0WTBmy0O4UnN0A7KBRAPYDsBeAg6Jppj0AEwE3p1ZGK5X6en0c4GoSTpOKN40jd4y/YU+SLKHCEI617oT11HQtVoYRTvMyzDRNMTnHfdpxqcAJ1V07Xl8OnrpKD8RSuUcYMK7OdlicaD1KIsdMsBnoRawoXod1MhO/7Fom0sxQQSaD4owDI0HhBdBjihLDRy7BRDH3rYAMsMLEOMfMTYmkPDc40nm5DAUdEpkkWPQslDJ326UXJoqtzPN50C28E3btJaES5lco/9rxIiZTB1KhM0WLXwMwioFc4Ewc4XiYpwPsnKZYf2CwUnK4KOduwTzVnRvnbm1DcS2/sK5jR3JoGF3AMuEujwOzlGfFmTjDhIv2uJqcNNbDrfXNN83K9u7UOsveY++JV/VC74W37b32jrxjj/gf/U/+Z/9L6WvpW+l76eeEuuRPe9a8uSj9+gO5hVWq</latexit>
D1 = Da ./left
K1,K2
Db
D2 = ⌧f1(⇤)(D1)
D3 = D2 ./left
K1,K2
Dc
D4 = ⌧f2(E,F )(D3)
D5 = ↵!
h(E):{E4,Ex,E1}(D4)
D6 = ⇡{Ax,B,Ay,D,C,F,E4,Ex,E1,}(D5)
DATAPLAT@ICDE
2024
18
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
df = pd.merge(df_A, df_B, on=['key1', 'key2'], how='left’) # join
df = df.fillna('imputed’) # Imputation
df = pd.merge(df, df_C, on=['key1', 'key2'], how='left’) #join
df = df.fillna(value={'E':'Ex', 'F':'Fx’}) # Imputation
# one-hot encoding
c = 'E'
dummies = []
dummies.append(pd.get_dummies(df[c]))
df_dummies = pd.concat(dummies, axis=1)
df = pd.concat((df, df_dummies), axis=1)
df = df_A.drop([c], axis=1)
DATAPLAT@ICDE
2024
19
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
Dataframes Diff template
D1 ß {Da, Db} Explicit join provenance pattern
D2 ß D1 value change, reduced nulls à imputation Data transformation
D3 ß {D2, Dc} Explicit join provenance pattern
D4 ß D3 value change, reduced nulls à imputation Data transformation
D45 ß D4 Shape change, column(s) added <wait!>
D6 ß D5 Shape change, column(s) removed Data transformation, composite
DATAPLAT@ICDE
2024
20
Program level transparency with control
Approach:
- add an observer to monitor dataframe changes
- mostly transparent to application
- some control over Tracker surfaced
DATAPLAT@ICDE
2024
21
Provenance traversals – example
Capture, store and query element-level provenance
- Derivation of each element of each intermediate dataframe (when possible)
- Efficiently, at scale
fillna
Join
df_1
df_B (df_0)
df_A (df_-1)
DATAPLAT@ICDE
2024
22
Benchmarking: data x pipelines
Datasets:
Pipelines:
Provenance graphs are stored
in a single Neo4J database
DATAPLAT@ICDE
2024
23
Results
The PT/PO ratio provides a rough indication of scalability:
- The graphs for the complete pipelines are close in size to the sum of the sizes of the components’
graphs
1,2,3: pipeline number
DATAPLAT@ICDE
2024
24
Conclusions
ü DPDS generates granular provenance graphs that accurately represent the
underlying data processing
ü A potentially useful building block towards explanations in a Data Centric AI
setting
Limitations:
v No granularity control --> limited scalability
v Operates only on Pandas dataframes
DATAPLAT@ICDE
2024

More Related Content

Similar to Design and Development of a Provenance Capture Platform for Data Science

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
Paolo Missier
 
Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j
Neo4j
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
Richard Garris
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
hdhappy001
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
hdhappy001
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
Azad public school
 
Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01
jade_22
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
Seldon
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
Tal Lavian Ph.D.
 
Jecb sigmod2014
Jecb sigmod2014Jecb sigmod2014
Jecb sigmod2014
Khai Tran
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
NECST Lab @ Politecnico di Milano
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
DataWorks Summit/Hadoop Summit
 
rerngvit_phd_seminar
rerngvit_phd_seminarrerngvit_phd_seminar
rerngvit_phd_seminar
rerngvit yanggratoke
 
Modelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri NetsModelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri Nets
Manuel Martín
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
Zbigniew Jerzak
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
Paolo Missier
 

Similar to Design and Development of a Provenance Capture Platform for Data Science (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 
Jecb sigmod2014
Jecb sigmod2014Jecb sigmod2014
Jecb sigmod2014
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
rerngvit_phd_seminar
rerngvit_phd_seminarrerngvit_phd_seminar
rerngvit_phd_seminar
 
Modelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri NetsModelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri Nets
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 

More from Paolo Missier

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
Paolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
Paolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Paolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
Paolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 

More from Paolo Missier (20)

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 

Recently uploaded

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
Claudio Di Ciccio
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 

Recently uploaded (20)

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 

Design and Development of a Provenance Capture Platform for Data Science

  • 1. Luca Gregori∗, Paolo Missier†, Matthew Stidolph‡, Riccardo Torlone∗ and Alessandro Wood∗ ∗Department of Engineering, Roma Tre University, Italy †School of Computer Science, University of Birmingham, UK ‡Newcastle University, School of Computing DATAPLAT@ICDE May 2024 Utrecht, NL Design and Development of a Provenance Capture Platform for Data Science
  • 2. 2 Setting and questions Model outputs Training datasets Source datasets Data processing Training M Inference/generation Data explanation questions: • Which data transformations were applied to raw input dataset(s) to generate the final training set used for modelling? • Which of the individual data items were affected by each of the transformations • What was the effect? DATAPLAT@ICDE 2024
  • 3. 3 Provenance basics Abstract data transformation operator: 𝐷 → (OP) → 𝐷ʹ D D’ A wasGeneratedBy wasDerivedFrom used Provenance expression: DATAPLAT@ICDE 2024
  • 4. 4 Extension to DAG topologies Example: inputs 𝐷0 𝑎, 𝐷0 𝑏 Dc 0 are processed independently and eventually merged into 𝐷𝑛: Da 0 OP1 Da 1 Db 0 OP2 Db 1 Dc 0 OP3 Dbc 0 OP4 Dabc 3 Da 0 OP1 Da 1 Db 0 OP2 Db 1 Dc 0 OP3 Dbc 0 OP4 Dabc 3 used used used used wgby wgby DATAPLAT@ICDE 2024
  • 5. 5 The Big Provenance Dogma Data provenance is an enabler for: • Transparency • Explainability • Reproducibility …for a variety of underlying process and source / target data combinations Model outputs Training datasets Data processing Training M Inference/generation Source Target Process DATAPLAT@ICDE 2024
  • 6. 6 DATAPLAT@ICDE 2024 Contributions ü Analysis of over 500 Data Science pipeline § “in the wild” --> Kaggle § “controlled” --> ML Bazaar ü Formal provenance semantics for a catalogue of commonly used Data Science operators ü Data Provenance for Data Science (DPDS) § automatically track granular provenance from Pandas § maximally transparent and minimally intrusive to the programmer ü Empirical evaluation against a grid of 3 benchmark datasets x 3 synthetic pipelines
  • 7. 7 Data processing pipelines analysis: ML Bazaar ü Facilitates developing ML and AutoML systems ü Workflow style: Pipelines composed out of pre-defined primitives ü Data + task pairs with benchmark results over multiple data types ✗ Only 5 types of operators ✗ Single location, controlled ecosystem DFS = Deep Feature Synthesis DATAPLAT@ICDE 2024
  • 8. 8 Data processing pipelines analysis: Kaggle Scope: top 200 most upvoted python notebooks related to machine learning on Kaggle Ø 29 unique pre-processing operations Ø 12 appear in less than 10 pipelines § Transposing § changing index values Ø feature augmentation (58) Ø scaling operations (38) DATAPLAT@ICDE 2024
  • 10. 10 Data reduction <latexit sha1_base64="caGX98B8rPEaUMv/+I4c5iOo7DY=">AAADNnicbVLLjtMwFHXDawivDizZGCrEIKEqQSNggzRiZsFykOjMSE2prl2nNXXsyL6mVFHWfA1b4FfYsENs+QEk3IeApnMlSyfnnGvH14eVSjpMkm+t6MLFS5ev7FyNr12/cfNWe/f2iTPectHjRhl7xsAJJbXooUQlzkoroGBKnLLp4UI/fS+sk0a/wXkpBgWMtcwlBwzUsH3v6CF9QbNSDg/3jh49pnT17eS4gCUVD9udpJssi26DdA06ZF3Hw93W72xkuC+ERq7AuX6alDiowKLkStRx5p0ogU9hLPoBaiiEG1TLu9T0QWBGNDc2LI10yf7fUUHh3LxgwVkATlxTW5DnaX2P+fNBJXXpUWi+Oij3iqKhi8HQkbSCo5oHANzK8K+UT8ACxzC+jVOYMVME5uo4zrSYcVMUoEdVZkxdZSg+IMsrU9ebYo51Px1Ufw2dtN6y/GuHpmbKIArtvBWLq9GM5dQ0PBNjwY+DD1Q5gbdVZuV4gmCtmTW3C5HYtIa5KRWebabP9b8zUgc3MzOUoqF5HZIURF8qv5hJCEzajMc2OHnSTZ9291/vdw5erqOzQ+6S+2SPpOQZOSCvyDHpEU4+kk/kM/kSfY2+Rz+inytr1Fr33CEbFf36A9XlEGY=</latexit> D0 = ⇡C(D), D0 = C(D) - Projection, Selection <latexit sha1_base64="fFqxFPpIMzZxYgmMXJgJEbRtTTU=">AAADX3icbVJdixMxFE1bddeqa1efxJdgEbogZcbvBx9WV1B8WsHuLjS1ZDJ3prGZZEgy1hLyn/w1gk/qDxFMP9i1070QOHPPucnk5CSl4MZG0c9Gs3Xl6rWd3evtGzdv7d3u7N85MarSDAZMCaXPEmpAcAkDy62As1IDLRIBp8n0aMGffgVtuJKf7LyEUUFzyTPOqA2tcecDKfnYEUcsfLPuiKf+EV7hdyBT0Oefr3PwxPseMTwv6NhddF89iXzv7cHBuNON+tGy8DaI16CL1nU83m/8JaliVQHSMkGNGcZRaUeOasuZAN8mlYGSsinNYRigpAWYkVte2uOHoZPiTOmwpMXL7v8TjhbGzIskKAtqJ6bOLZqXccPKZi9HjsuysiDZ6qCsEtgqvHAQp1wDs2IeAGWah3/FbEI1ZTb4vHFKotTU0sT4dptImDFVFFSmjijlV/4lmVPeb5KZ9cN45M4F3dhvSS7GaZ1TZSBBmkrD4mqYJBlWNc1EaVrlQUdFOaGfHdE8n1iqtZrVtwvZ2ZQG34QIzzaTl+q/KC6DOlEzy6HGVTJELpBVKaqFJyEwcT0e2+DkcT9+3n/28Wn38M06OrvoPnqAeihGL9Aheo+O0QAx9B39QL/Q7+af1k5rr9VZSZuN9cxdtFGte/8AGM4hrw==</latexit> ⇡{Cid,Gender,Age}( Age<30(D)) DATAPLAT@ICDE 2024
  • 11. 11 Data augmentation Vertical augmentation <latexit sha1_base64="Jkv8keMS0FhjcfbzwX5TGOOML7Q=">AAADJnicbVJNj9MwEHXDxy7lq4UjF4sKablUCVoB2tMKLhwXie4WNaWauE5j6tiRPSZUUX4KV+DXcEOIG38ECaetgKY7kqWneW884/FLCikshuHPTnDl6rXrB4c3ujdv3b5zt9e/d261M4yPmJbajBOwXArFRyhQ8nFhOOSJ5BfJ8mXDX3zgxgqt3uCq4NMcFkqkggH61KzXjzNtwC1mVXo0fnxC39az3iAchuug+yDaggHZxtms3/kdzzVzOVfIJFg7icICpxUYFEzyuhs7ywtgS1jwiYcKcm6n1Xr2mj7ymTlNtfFHIV1n/6+oILd2lSdemQNmts01ycu4icP0+bQSqnDIFds0Sp2kqGmzCDoXhjOUKw+AGeFnpSwDAwz9una6JFovERJbd7ux4iXTeQ5qXsVa11WM/CMmaaXrepdMsZ5E0+qvYBDVe5J/5dDmdOFJrqwzvHkajZOU6pZm83NeB7LI4F0VG7HIEIzRZfs6b4Fdqd+blP7bSnWp/r0WyqsTXaLgLc4p7xxPukK6ZifeMFHbHvvg/Mkwejo8fn08OH2xtc4heUAekiMSkWfklLwiZ2REGCnJJ/KZfAm+Bt+C78GPjTTobGvuk50Ifv0By7kNaw==</latexit> ↵! f(X):Y <latexit sha1_base64="KZhlQb7RQuvbDZlIWBGjNXy9o1c=">AAADPnicbVLLjhMxEHSGxy7hlYUjF4sIKblEGbSCFaflceC4ILK7UiZEPY4nMfHYI7tNiKz5Br6GK/Ab/AA3xJULEp4kAjLZliyVu6rddrvSQgqL/f63RnTp8pWre/vXmtdv3Lx1u3Vw59RqZxgfMC21OU/BcikUH6BAyc8LwyFPJT9L588r/uw9N1Zo9QaXBR/lMFUiEwwwpMatbjLTBtx07LNx3Eky9E+nvOw+oRWEKX8NKuzLzovuuNXu9/qroLsg3oA22cTJ+KDxO5lo5nKukEmwdhj3Cxx5MCiY5GUzcZYXwOahzTBABTm3I796U0kfhMyEZtqEpZCusv9XeMitXeZpUOaAM1vnquRF3NBhdjTyQhUOuWLrRpmTFDWtBkQnwnCGchkAMCPCXSmbgQGGYYxbXVKt5wipLZvNRPEF03kOauITrUufIP+AaeZ1WW6TGZbDeOT/CtpxuSP5Vw51TheB5Mo6w6un0STNqK5p1j8adCCLGbz1iRHTGYIxelE/LlhjWxrmJmX4toW6UP9OCxXUqV6g4DXOqeCoQLpCumomwTBx3R674PRhL37UO3x12D5+trHOPrlH7pMOicljckxekhMyIIx8JJ/IZ/Il+hp9j35EP9fSqLGpuUu2Ivr1B33rF1I=</latexit> ↵! f1(Age):ageRange(D) group by gender avg(age) Horizontal augmentation <latexit sha1_base64="/Fez8VR4cSmlF01/YiVQsD5zSEs=">AAADP3icbZLNjtMwEMfd8LWUj+3CkUtEhdTlUDXVChAS0vIlOC4S3V2pKZHjTlpTx47sMaWK/A48DVfgMXgCbogrBySctoJtuiNF+mf+v4njmUkLwQ32et8bwYWLly5f2bnavHb9xs3d1t6tY6OsZjBgSih9mlIDgksYIEcBp4UGmqcCTtLZ88o/+QDacCXf4qKAUU4nkmecUfSppHX/ZdJ/EnuC2klSxhmWr0COQbvHWdLvVO9PJ+D2XefFftJq97q9ZYTbIlqLNlnHUbLX+BOPFbM5SGSCGjOMegWOSqqRMwGuGVsDBWUzOoGhl5LmYEbl8lIuvOcz4zBT2j8Sw2X2bEVJc2MWeerJnOLU1L0qeZ43tJg9GpVcFhZBstVBmRUhqrDqUDjmGhiKhReUae7/NWRTqilD38eNU1KlZkhT45rNWMKcqTynclzGSrkyRviIaVYq5zbNDN0wGpX/gHbktpD/5bTuqcKbII3VUF0tjNMsVDVmqqpxeo6KYkrflbHmkylSrdW8/rnV5M+gvm9C+LHN5bn8e8Wlp1M1Rw41z0q/Ut60hbBVT/zCRPX12BbH/W70oHvw5qB9+Gy9OjvkDrlLOiQiD8kheU2OyIAw8ol8Jl/I1+Bb8CP4GfxaoUFjXXObbETw+y86OxeP</latexit> E2 = ↵# Gender:f2(Age)(D) <latexit sha1_base64="bJJOqZd/k6cJtV5UgJsl0/znBVA=">AAADKHicbZJNj9MwEIbd8LFL+eqyRy4WFVL3UiWIjxWnFXDguEh0t6gJ1cR1WlPHjuwxpYryW7gCv4Yb2iv/AwmnrWCb7kiRXs3zju3MTFpIYTEML1rBtes3bu7t32rfvnP33v3OwYMzq51hfMC01GaYguVSKD5AgZIPC8MhTyU/T+eva37+mRsrtHqPy4InOUyVyAQD9Klx5zD2FNx0XA5fZr0PR1XvzdG40w374Srorog2oks2cTo+aP2JJ5q5nCtkEqwdRWGBSQkGBZO8asfO8gLYHKZ85KWCnNukXL2+oo99ZkIzbfynkK6ylytKyK1d5ql35oAz22R18io2cpgdJ6VQhUOu2PqizEmKmtatoBNhOEO59AKYEf6tlM3AAEPfsK1bUq3nCKmt2u1Y8QXTeQ5qUsZaV2WM/AumWamrahtmWI2ipPxn6EbVjuV/OTSZLjzkyjrD61+jcZpR3fDMdD077wNZzOBjGRsxnSEYoxfN49ZjvmT1fZPSj22hrvR/0kJ5d6oXKHiDOeV3x0NXSFf3xC9M1FyPXXH2pB897z9797R78mqzOvvkIXlEeiQiL8gJeUtOyYAwsiRfyTfyPfgR/Ax+BRdra9Da1BySrQh+/wVrcA35</latexit> ↵# X:f(Y )(D) DATAPLAT@ICDE 2024
  • 12. 12 Data transformation <latexit sha1_base64="XtRrctBkqIU93sb+UHrmtJtjUkA=">AAADHnicbZJNbxMxEIad5aMlfLVw5LIiQiqXaBdVwLGCC8cikTbSbojGjjdr4rVX9rghsvZncAV+DTfEFX4MEt40ArLpSJZezfuMP8ZDayksJsmvXnTt+o2be/u3+rfv3L13/+DwwZnVzjA+YlpqM6ZguRSKj1Cg5OPacKio5Od08br1zy+4sUKrd7iq+aSCuRKFYIAhleUIbuqLo/HTZnowSIbJOuJdkW7EgGzidHrY+53PNHMVV8gkWJulSY0TDwYFk7zp587yGtgC5jwLUkHF7cSv79zET0JmFhfahKUwXmf/r/BQWbuqaCArwNJ2vTZ5lZc5LF5OvFC1Q67Y5UGFkzHquG1APBOGM5SrIIAZEe4asxIMMAxt2jqFar1AoLbp93PFl0xXFaiZz7VufI78I9LC66bZNgtssnTi/wKDtNlB/pVD19N1MLmyzvD2aXFOi1h3mFIbcPPAgaxLeO9zI+YlgjF62d0ufP02GvomZfi2pbqS/6CFCjTVSxS84zkVJiaYrpau7UkYmLQ7Hrvi7NkwfT48fns8OHm1GZ198og8JkckJS/ICXlDTsmIMKLJJ/KZfIm+Rt+i79GPSzTqbWoekq2Ifv4B+VsLDw==</latexit> ⌧f(X) <latexit sha1_base64="Q7sjzw3r7FpZN6MWGMj9azYMGFk=">AAAD5HicbVJLb9NAELYbHiW8WjhyWREjFQlFccXrWAEHjkWiDykO0ex6N1663rX20RBZ/gfcEFf+Emd+DBKzaQQk7Vw8O9/3zYxnhjZKOj8a/Uq3eteu37i5fat/+87de/d3dh8cOxMs40fMKGNPKTiupOZHXnrFTxvLoaaKn9CztxE/OefWSaM/+kXDJzXMtBSSgcfQdOdnoY3UJdee+IqTwvMvmKX1FrQTxtZLWkeMIEAc99ERHHyw3JHsNIvv7F1GgpN6hhQRNIsKkomMFAWRjhjqAZsrCV0QF6jD9MFHNgdWkXNQgZOsnLayEF1G5tJXKN7DQAHOY+xp9ixmwmYuFKvyJCuwhGEsWBuzSR37GU53BqPhaGnkspOvnEGyssPpbvq7KA0LNY6AKXBunI8aP2nBeskU7/pFcLwBdgYzPkZXQ83dpF1OviNPMFIuexMGR7iM/q9ooXZuUVNk4igrt4nF4FXYOHjxetJK3QTPNbsoJIIi3pC4RlJKy5lXC3SAWYm9ElaBBeZx2WtVqDFnHqjr+v1C8zkzdQ26bAtjuna5bipa03XroPDdOJ+0fwmDvLtE+SeHTcw0CHLtcE/x10hBBTEbnMpYCDPkgWoq+NQWVs4qD9aa+WY6POB1Ks5NKVzbXF/J/4wnjWxq5l7yDSzoeND4bVSIM8GDyTfP47JzvD/MXw6ff9gfHLxZnc528ih5nOwlefIqOUjeJ4fJUcLSF+k4LVPeE72vvW+97xfUrXSleZisWe/HH5yBTeM=</latexit> the transformation of a set of features X of D using a function f is obtained by substituting each value dia with f(d⇤a), for each feature a occurring in X. Example: data imputation. Here f replaces nulls with the most frequent value, for column Zip <latexit sha1_base64="dKf0psuUtfBq7WDfOX5DpzZK5ls=">AAADKnicbZLNjtMwFIXd8DeUvw6IFZuICqmzqRI0ApYjYMFykOjMiCZUN67Tmjp2ZF9TKssPwxZ4GnYjtrwGEk6nAprOlSId3fPd2Lk5RS24wSQ570RXrl67fmPvZvfW7Tt37/X2758YZTVlI6qE0mcFGCa4ZCPkKNhZrRlUhWCnxeJV459+YtpwJd/hqmZ5BTPJS04BQ2vSe5gh2IkrB1mJ7j2v/YEfvD6Y9PrJMFlXvCvSjeiTTR1P9ju/s6mitmISqQBjxmlSY+5AI6eC+W5mDauBLmDGxkFKqJjJ3fr+Pn4SOtO4VDo8EuN19/8JB5Uxq6oIZAU4N22vaV7mjS2WL3LHZW2RSXpxUGlFjCpulhFPuWYUxSoIoJqHu8Z0DhoohpVtnVIotUAojO92M8mWVFUVyKnLlPIuQ/YZi9Ip77fNEv04zd1foJ/6HeTfOLQ9VQeTSWM1az4tzooyVi1mrjTYWeBA1HP44DLNZ3MErdWy/boQg2007E2I8NuW8lL+o+Iy0IVaImctz8qQnmDaWthmJyEwaTseu+Lk6TB9Njx8e9g/ermJzh55RB6TAUnJc3JE3pBjMiKUOPKFfCXfou/Rj+g8+nmBRp3NzAOyVdGvPwnyD0I=</latexit> ⌧f(Zip)(D) DATAPLAT@ICDE 2024
  • 13. 13 Data fusion: join and append <latexit sha1_base64="uo1XC2O2rrqRH/7jgx2X/lPakP4=">AAADKHicbZLNbtNAFIUn5q+Ev5Qu2YyIkFhFNqoKy6rtggWLgkhbKXai68k4HjKesWbuNESWn4Ut8DTsULe8BxLjNALi9EqWju75rmd8fdJSCotheNUJbt2+c/fezv3ug4ePHj/p7T49s9oZxodMS20uUrBcCsWHKFDyi9JwKFLJz9P5ceOfX3JjhVYfcVnypICZEplggL416e2djN/R+JMWaoyT6rimJ+MPk14/HISrotsiWos+WdfpZLfzO55q5gqukEmwdhSFJSYVGBRM8robO8tLYHOY8ZGXCgpuk2p1+5q+8J0pzbTxj0K66v4/UUFh7bJIPVkA5rbtNc2bvJHD7E1SCVU65IpdH5Q5SVHTZhV0KgxnKJdeADPC35WyHAww9AvbOCXVeo6Q2rrbjRVfMF0UoKZVrHVdxcg/Y5pVuq43zQzrUZRUf4F+VG8h/8ah7enSm1xZZ3jzaTROM6pbTK4NuJnnQJY5jKvYiFmOYIxetF/nQ7CJ+r1J6X/bQt3IN5HwdKoXKHjLc8pnx5uulK7ZiQ9M1I7Htjh7NYgOBvvv9/uHR+vo7JBn5Dl5SSLymhySt+SUDAkjS/KFfCXfgu/Bj+BncHWNBp31zB7ZqODXH8rzDh4=</latexit> DL ./t C DR <latexit sha1_base64="fiWoK5ivN8nYSDBQRhG2qdf4NTc=">AAADIXicbZJNbxMxEIad5auErxaOXCwiJE7RLqoKxwp64MChINJWym6qsePNmnjtlT0mRKv9H1yBX8MNcUP8FiS8aQRk05EsvZr3GX+Mh1VKOozjn73oytVr12/s3Ozfun3n7r3dvfsnznjLxYgbZewZAyeU1GKEEpU4q6yAkilxyuYvW//0g7BOGv0Ol5XISphpmUsOGFKTo8lrmnodJD2avD3fHcTDeBV0WyRrMSDrOD7f6/1Op4b7UmjkCpwbJ3GFWQ0WJVei6afeiQr4HGZiHKSGUrisXl27oY9DZkpzY8PSSFfZ/ytqKJ1bliyQJWDhul6bvMwbe8yfZ7XUlUeh+cVBuVcUDW17QKfSCo5qGQRwK8NdKS/AAsfQqY1TmDFzBOaafj/VYsFNWYKe1qkxTZ2i+Igsr03TbJo5NuMkq/8Cg6TZQv6VQ9czVTCFdt6K9mk0ZTk1HaYwFvwscKCqAiZ1auWsQLDWLLrbhd/fREPflArfttCX8u+N1IFmZoFSdLzVpATTV8q3PQkDk3THY1ucPB0mB8P9N/uDwxfr0dkhD8kj8oQk5Bk5JK/IMRkRTiz5RD6TL9HX6Fv0PfpxgUa9dc0DshHRrz8U/gvI</latexit> DL ] DR <latexit sha1_base64="ZSc/aIuuYda02WJ0QVQW8PzBr8E=">AAADIHicbZJNbxMxEIad5auErxaOXCwiJE7RLqqAYwU9cOBQEGkrZTfV2PFmTbz2Yo8J0Wp/B1fg13BDHOG/IOFNIyCbjmTp1bzP+GM8rFLSYRz/7EWXLl+5em3nev/GzVu37+zu3T12xlsuRtwoY08ZOKGkFiOUqMRpZQWUTIkTNn/R+icfhHXS6Le4rERWwkzLXHLAkMoOJ69Sr4Oih5M3Z7uDeBivgm6LZC0GZB1HZ3u93+nUcF8KjVyBc+MkrjCrwaLkSjT91DtRAZ/DTIyD1FAKl9WrWzf0YchMaW5sWBrpKvt/RQ2lc8uSBbIELFzXa5MXeWOP+bOslrryKDQ/Pyj3iqKhbQvoVFrBUS2DAG5luCvlBVjgGBq1cQozZo7AXNPvp1osuClL0NM6NaapUxQfkeW1aZpNM8dmnGT1X2CQNFvIv3LoeqYKptDOW9E+jaYsp6bDFMaCnwUOVFXApE6tnBUI1ppFd7vw+Zto6JtS4dsW+kL+nZE60MwsUIqOt5qUYPpK+bYnYWCS7nhsi+PHw+TJcP/1/uDg+Xp0dsh98oA8Igl5Sg7IS3JERoST9+QT+Uy+RF+jb9H36Mc5GvXWNffIRkS//gCWmQue</latexit> DL ] DR <latexit sha1_base64="Tf7s3qEix3yKzKbh9vcpsGLm1tk=">AAADSXicbVLdihMxGE2n/qz1r6uX3gSL4FWZkaLeCIu7FwperGJ3FzrTkkkzbWwmGZIv1hLyIj6Nt+oT+BjeiSCY6ZbVTveDgZNzzpdMvpy8EtxAHP9oRe0rV69d37vRuXnr9p273f17J0ZZTdmQKqH0WU4ME1yyIXAQ7KzSjJS5YKf54rDWTz8ybbiS72FVsawkM8kLTgkEatIdHI3fpB8Ul2OXmgJzKZn2ExfYflqAO3w99S+Oxu8uFh6H1aTbi/vxuvAuSDaghzZ1PNlv/UmnitqSSaCCGDNK4goyRzRwKpjvpNawitAFmbFRgJKUzGRufT2PHwVmigulwycBr9n/OxwpjVmVeXCWBOamqdXkZdrIQvE8c1xWFpik5wcVVmBQuJ4VnnLNKIhVAIRqHv4V0znRhEKY6NYpuVILILnxnU4q2ZKqsiRy6lKlvEuBfYK8cMr7bbEAP0oyd2HoJX7H8q+dNDVVBZFJYzWrr4bTvMCq4ZkrTews+Iio5iS8seazORCt1bK5XUjJtjXMTYjwbEt5qb8OTXDnagmcNTQrQ7iCaCth65mEwCTNeOyCkyf95Gl/8HbQO3i5ic4eeoAeoscoQc/QAXqFjtEQUfQZfUFf0bfoe/Qz+hX9PrdGrU3PfbRV7fZfdvcasg==</latexit> DL ./inner DL.CId=DR.CId DR DATAPLAT@ICDE 2024
  • 14. 14 Conceptual provenance capture model: templates <latexit sha1_base64="Q+fPf+TzQY7bxgC074TZYQmdfIg=">AAAKYHicjZZfb9s2EMDldn9Sr12T7W17IRYES7E1s4cWG/ZUZ83SAEXiFUlbIPYMSjrJRClSIym7hqAPucc97GWfZEfZiylK7SbAAI/3uzuSdzw6zDnTZjD4s3fr9gcffvTxzp3+J3fvfXp/d++zl1oWKoKrSHKpXodUA2cCrgwzHF7nCmgWcngVvvnZ6l8tQGkmxaVZ5TDNaCpYwiJqcGq2u5wIWEYyy6iIy0liquvhtCwnBt6aMCn3h1VV9RvIXCpapFU5oTyf09/KiWLp3FCl5NKia/WsTGbDQ3RXjlKoHvxE7JCm8IIKlKvDpw9mu/uDo0H9kfZguBnsB5tvPNvb+WMSy6jIQJiIU62vh4PcTEuqDIs4YOhCQ06jNxjmGoeCZqCnZX1CFTnAmZgkUuFPGFLPuhYlzbReZSGSGTVz7evsZJfuujDJj9OSibwwIKJ1oKTgxEhij5vETEFk+AoHNFIM10qiOVU0MpiUPn6Nw82VXODR2jDMlLXkHX8M+RawQuV7cO19rQyTrZqGumUtuWOOQuUvMN3qx6e+OYit9kT4WtzzVj1Cwdc7vkdpK7TNIALPN0Qteh6WabhykGV6vPIRJhJeYKbA5ag++3c6bpss46QJPwXFFhD/omTWYumywS5bmzRGXcoG0zoIO/VeIAYOKTXuHmgo227sto5XrZ1KlXXtU+MtdgOvZT/Fbg5A2BR4eTpxKvikVb/YnKSKwVYpDiP4/R36XLEMbqCv/WKARUadi7AWW0sRMoYtdG4lL5y9o1uilvwNcxqCcyvWor+etDgZ2dVi50uL74BWFSEHJAXxsNDYJojEHkywczHQ3xK8CGzB7Nh3whwvrHbjE/RkdHqDfIOIvSnNWEiSTKp3BiU5LzTBxiwMdqCDAyJzUNRI5S9HycI549NabN1Zj6JpJ6e7nM2wxFTrgYm41E4TqkUPUZA7GbESjVqJo2kTW8sdoIIlNmnXXy37dfDW2HLfltxa9sv3f60+c4NlmKautdfzTkN80UlKha+w8OmLerbbgnLebTTi/H12iJ9pybHtxI3l30z6eZT/nSPl3y4fGNt3d/vijC6f+cTZ+fkWmCywZc3B0Bk+ya3Kuri67ERlYVrs2Xk3y0SbjZtJj7uSPr547viLKCfjqsI/QUP/L0978PL7o+Hjo8Gvj/afHG/+Du0EXwZfBYfBMPgheBI8C8bBVRAFf/Vu9+727t35u7/Tv9/fW6O3ehubz4PG1//iH9y29FY=</latexit> ↵! f1(Age):ageRange(D) A different provenance template pt𝜏 is associated with each type 𝜏 of operator
  • 15. 15 Capturing provenance: bindings At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected Data items from the inputs and outputs of the operator are used to bind the variables in the template 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V} à {new values: F’, J, V’} + Binding rules <latexit sha1_base64="icVdmbcCfxxYOiITpBtlS3uqwUQ=">AAAD+HicdZNdb9MwFIaTlY8RPtbBJTdHVJQhVVWDJkCTKk2AJsbVkOjWqQ6V4zqtmWNHtrOuBP8X7hC3/Buu+R1IOGkFbTccKTo672O/yTnHccaZNp3OT3+jdu36jZubt4Lbd+7e26pv3z/WMleE9ojkUvVjrClngvYMM5z2M0VxGnN6Ep+9LvWTc6o0k+KDmWU0SvFYsIQRbFxqWP8VNAEZemGKA6nAAtsLAY2k0SD2mggFTQTVUyHVm5ki13RkgQrT3rMwQByLMadwAF3oD9MWHHZZC467b4YFa7mEBaTmxBcoAUBMQB8i+N/xYyqowmbJY8nkiXM5HU5a8K50Oe8mO3Mf+3TZxhGVzSlEQTCsNzrtTrXgchAugoa3WEfDbf+3qwHJU2dPONZ6EHYyExVYGUY4tQFyFcgwOcNjOnChwCnVUVF1w8LjsjyQuHImUhiosss7CpxqPUtjR6bYTPS6Viav0ga5SV5GBRNZbqggc6Mk52AklK2FEVOUGD5zASaKuW8FMsEKE+MGYMUllvLM4FjbIECCTolMUyxGBZLSzrsQJ4W0dlVMjB2EUfEXaIT2EvJvO17XZOZEKnSuaPlrgOIE5BozkQrnY8dhnk3wxwIpNp4YrJScrh/nhnoVdXXj3LVtKq7kP0kmHB3LqWF0TcuFuwtOzDOelzVxAxOuj8fl4PhZO3ze3n2/29h/tRidTe+h98jb8ULvhbfvvfWOvJ5H/ENf+hf+rPa59rX2rfZ9jm74iz0PvJVV+/EHk+tPwQ==</latexit> For i : 1 . . . n : used ent.:[hF = Xm, I = i, V = Di,Xm i|Xm 2 X] generated ent.:[hF0 = Yh, J = i, v = f(Di,X )i|Yh 2 Y ]
  • 16. 16 Implementation by shape and value diff Shape changes: Rows Added? Rows Removed? Columns Added? Columns Removed? Columns Removed? Horizontal Augmentation Reduction by selection Reduction by projection data transformation (composite) Y Y Y Y data transformation Y N N N Templates: N Value changes for each column: Nulls reduced? Values changed? Y Y N Templates: data transformation (imputation) data transformation 1-1 derivations For each input/output pair Din, Dout of dataframes: 1. Diff both shapes and values of Din, Dout 2. Use the diff to: • Select the appropriate template • Bind the template variables using the relevant values in the two dataframes • Generate an instantiated provlet DATAPLAT@ICDE 2024
  • 17. 17 Running Example D1 D2 D3 Add ‘E4,’ ‘Ex’, ‘E1’ Remove ‘E’ D4 D6 Da Db Left join (K1,K2) Impute all missing Dc Left join (K1,K2) Impute E,F D5 <latexit sha1_base64="vtTzVqyQbOaTVii0idD+QwwhSJQ=">AAAEKXicfVNbb9MwFE5WLqPcNvbIi0UFalE0NV03EFKl3Toh7WVI7CLVxXJcpzVz7Mh26IqV/8Ir8Gt4A175HUg4bRltN3GkREfn+853Ep/PUcqZNvX6D3+pdOPmrdvLd8p3791/8HBl9dGJlpki9JhILtVZhDXlTNBjwwynZ6miOIk4PY3O9wr89ANVmknx1oxS2k1wX7CYEWxcCa36a8DFM7CPwtY+wgC+l0y8s9AYwGlscmQPURgcokbuKBGE5b/0RgsanCEbo7D6vJZXnUBtBt5wao3/q5EZevNSrVFtBwdjvY1ZPbuZt+BAKpz1kR1U27VX0LZRMwBtdFG8QpgXPc3Znq0WTBmy0O4UnN0A7KBRAPYDsBeAg6Jppj0AEwE3p1ZGK5X6en0c4GoSTpOKN40jd4y/YU+SLKHCEI617oT11HQtVoYRTvMyzDRNMTnHfdpxqcAJ1V07Xl8OnrpKD8RSuUcYMK7OdlicaD1KIsdMsBnoRawoXod1MhO/7Fom0sxQQSaD4owDI0HhBdBjihLDRy7BRDH3rYAMsMLEOMfMTYmkPDc40nm5DAUdEpkkWPQslDJ326UXJoqtzPN50C28E3btJaES5lco/9rxIiZTB1KhM0WLXwMwioFc4Ewc4XiYpwPsnKZYf2CwUnK4KOduwTzVnRvnbm1DcS2/sK5jR3JoGF3AMuEujwOzlGfFmTjDhIv2uJqcNNbDrfXNN83K9u7UOsveY++JV/VC74W37b32jrxjj/gf/U/+Z/9L6WvpW+l76eeEuuRPe9a8uSj9+gO5hVWq</latexit> D1 = Da ./left K1,K2 Db D2 = ⌧f1(⇤)(D1) D3 = D2 ./left K1,K2 Dc D4 = ⌧f2(E,F )(D3) D5 = ↵! h(E):{E4,Ex,E1}(D4) D6 = ⇡{Ax,B,Ay,D,C,F,E4,Ex,E1,}(D5) DATAPLAT@ICDE 2024
  • 18. 18 Running Example D1 D2 D3 Add ‘E4,’ ‘Ex’, ‘E1’ Remove ‘E’ D4 D6 Da Db Left join (K1,K2) Impute all missing Dc Left join (K1,K2) Impute E,F D5 df = pd.merge(df_A, df_B, on=['key1', 'key2'], how='left’) # join df = df.fillna('imputed’) # Imputation df = pd.merge(df, df_C, on=['key1', 'key2'], how='left’) #join df = df.fillna(value={'E':'Ex', 'F':'Fx’}) # Imputation # one-hot encoding c = 'E' dummies = [] dummies.append(pd.get_dummies(df[c])) df_dummies = pd.concat(dummies, axis=1) df = pd.concat((df, df_dummies), axis=1) df = df_A.drop([c], axis=1) DATAPLAT@ICDE 2024
  • 19. 19 Running Example D1 D2 D3 Add ‘E4,’ ‘Ex’, ‘E1’ Remove ‘E’ D4 D6 Da Db Left join (K1,K2) Impute all missing Dc Left join (K1,K2) Impute E,F D5 Dataframes Diff template D1 ß {Da, Db} Explicit join provenance pattern D2 ß D1 value change, reduced nulls à imputation Data transformation D3 ß {D2, Dc} Explicit join provenance pattern D4 ß D3 value change, reduced nulls à imputation Data transformation D45 ß D4 Shape change, column(s) added <wait!> D6 ß D5 Shape change, column(s) removed Data transformation, composite DATAPLAT@ICDE 2024
  • 20. 20 Program level transparency with control Approach: - add an observer to monitor dataframe changes - mostly transparent to application - some control over Tracker surfaced DATAPLAT@ICDE 2024
  • 21. 21 Provenance traversals – example Capture, store and query element-level provenance - Derivation of each element of each intermediate dataframe (when possible) - Efficiently, at scale fillna Join df_1 df_B (df_0) df_A (df_-1) DATAPLAT@ICDE 2024
  • 22. 22 Benchmarking: data x pipelines Datasets: Pipelines: Provenance graphs are stored in a single Neo4J database DATAPLAT@ICDE 2024
  • 23. 23 Results The PT/PO ratio provides a rough indication of scalability: - The graphs for the complete pipelines are close in size to the sum of the sizes of the components’ graphs 1,2,3: pipeline number DATAPLAT@ICDE 2024
  • 24. 24 Conclusions ü DPDS generates granular provenance graphs that accurately represent the underlying data processing ü A potentially useful building block towards explanations in a Data Centric AI setting Limitations: v No granularity control --> limited scalability v Operates only on Pandas dataframes DATAPLAT@ICDE 2024