Design and Development of a Provenance Capture Platform for Data Science

pmissier 276 views 24 slides May 14, 2024
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

A talk given at the DATAPLAT workshop, co-located with the IEEE ICDE conference (May 2024, Utrecht, NL).

Data Provenance for Data Science is our attempt to provide a foundation to add explainability to data-centric AI.
It is a prototype, with lots of work still to do.


Slide Content

Luca Gregori∗, Paolo Missier†, Matthew Stidolph‡, Riccardo Torlone∗and Alessandro Wood∗
∗Department of Engineering, Roma Tre University, Italy
†School of Computer Science, University of Birmingham, UK
‡Newcastle University, School of Computing
DATAPLAT@ICDE
May 2024
Utrecht, NL
Design and Development of a Provenance
Capture Platform for Data Science

2
Setting and questions
Model
outputs
Training
datasets
Source
datasets
Data processingTrainingMInference/generation
Data explanation questions:
•Which data transformations were applied to raw input dataset(s) to generate the final
training set used for modelling?
•Which of the individual data items were affected by each of the transformations
•What was the effect?
DATAPLAT@ICDE 2024

3
Provenance basics
Abstract data transformation operator: ! → (OP) → !ʹ
D D’AwasGeneratedBy
wasDerivedFrom
used
Provenance expression:
DATAPLAT@ICDE 2024

4
Extension to DAG topologies
Example: inputs "0!, "0" Dc0 are processed independently and eventually merged into "#:
Da0OP1Da1
Db0OP2Db1
Dc0
OP3Dbc0
OP4Dabc3
Da0OP1Da1
Db0OP2Db1
Dc0
OP3Dbc0
OP4Dabc3
used
used
used
usedwgby
wgby
DATAPLAT@ICDE 2024

5
The Big Provenance Dogma
Data provenance is an enabler for:
•Transparency
•Explainability
•Reproducibility
…for a variety of underlying process and source / target data combinations
Model
outputs
Training
datasetsData processingTrainingMInference/generation
SourceTargetProcess
DATAPLAT@ICDE 2024

6
DATAPLAT@ICDE 2024
Contributions
üAnalysis of over 500 Data Science pipeline
§“in the wild” --> Kaggle
§“controlled” --> ML Bazaar
üFormal provenance semantics for a catalogue of commonly used Data Science operators
üData Provenance for Data Science (DPDS)
§automatically track granular provenance from Pandas
§maximally transparent and minimally intrusive to the programmer
üEmpirical evaluation against a grid of 3 benchmark datasets x 3 synthetic pipelines

7
Data processing pipelines analysis: ML Bazaar
üFacilitates developing ML and AutoML systems
üWorkflow style: Pipelines composed out of pre-defined primitives
üData + task pairs with benchmark results over multiple data types
✗Only 5 types of operators
✗Single location, controlled ecosystem
DFS = Deep Feature Synthesis
DATAPLAT@ICDE 2024

8
Data processing pipelines analysis: Kaggle
Scope: top 200 most upvoted python notebooks related to machine learning on Kaggle
Ø29 unique pre-processing operations
Ø12 appear in less than 10 pipelines
§Transposing
§changing index values
Øfeature augmentation (58)
Øscaling operations (38)
DATAPLAT@ICDE 2024

9
Data processing operators
DATAPLAT@ICDE 2024

10
Data reduction
<latexit sha1_base64="caGX98B8rPEaUMv/+I4c5iOo7DY=">AAADNnicbVLLjtMwFHXDawivDizZGCrEIKEqQSNggzRiZsFykOjMSE2prl2nNXXsyL6mVFHWfA1b4FfYsENs+QEk3IeApnMlSyfnnGvH14eVSjpMkm+t6MLFS5ev7FyNr12/cfNWe/f2iTPectHjRhl7xsAJJbXooUQlzkoroGBKnLLp4UI/fS+sk0a/wXkpBgWMtcwlBwzUsH3v6CF9QbNSDg/3jh49pnT17eS4gCUVD9udpJssi26DdA06ZF3Hw93W72xkuC+ERq7AuX6alDiowKLkStRx5p0ogU9hLPoBaiiEG1TLu9T0QWBGNDc2LI10yf7fUUHh3LxgwVkATlxTW5DnaX2P+fNBJXXpUWi+Oij3iqKhi8HQkbSCo5oHANzK8K+UT8ACxzC+jVOYMVME5uo4zrSYcVMUoEdVZkxdZSg+IMsrU9ebYo51Px1Ufw2dtN6y/GuHpmbKIArtvBWLq9GM5dQ0PBNjwY+DD1Q5gbdVZuV4gmCtmTW3C5HYtIa5KRWebabP9b8zUgc3MzOUoqF5HZIURF8qv5hJCEzajMc2OHnSTZ9291/vdw5erqOzQ+6S+2SPpOQZOSCvyDHpEU4+kk/kM/kSfY2+Rz+inytr1Fr33CEbFf36A9XlEGY=</latexit>
D
0
=⇡C(D),D
0
=DC(D)
-Projection, Selection
<latexit sha1_base64="fFqxFPpIMzZxYgmMXJgJEbRtTTU=">AAADX3icbVJdixMxFE1bddeqa1efxJdgEbogZcbvBx9WV1B8WsHuLjS1ZDJ3prGZZEgy1hLyn/w1gk/qDxFMP9i1070QOHPPucnk5CSl4MZG0c9Gs3Xl6rWd3evtGzdv7d3u7N85MarSDAZMCaXPEmpAcAkDy62As1IDLRIBp8n0aMGffgVtuJKf7LyEUUFzyTPOqA2tcecDKfnYEUcsfLPuiKf+EV7hdyBT0Oefr3PwxPseMTwv6NhddF89iXzv7cHBuNON+tGy8DaI16CL1nU83m/8JaliVQHSMkGNGcZRaUeOasuZAN8mlYGSsinNYRigpAWYkVte2uOHoZPiTOmwpMXL7v8TjhbGzIskKAtqJ6bOLZqXccPKZi9HjsuysiDZ6qCsEtgqvHAQp1wDs2IeAGWah3/FbEI1ZTb4vHFKotTU0sT4dptImDFVFFSmjijlV/4lmVPeb5KZ9cN45M4F3dhvSS7GaZ1TZSBBmkrD4mqYJBlWNc1EaVrlQUdFOaGfHdE8n1iqtZrVtwvZ2ZQG34QIzzaTl+q/KC6DOlEzy6HGVTJELpBVKaqFJyEwcT0e2+DkcT9+3n/28Wn38M06OrvoPnqAeihGL9Aheo+O0QAx9B39QL/Q7+af1k5rr9VZSZuN9cxdtFGte/8AGM4hrw==</latexit>

{Cid,Gender,Age}(DAge<30(D))
DATAPLAT@ICDE 2024

11
Data augmentation
Vertical augmentation
<latexit sha1_base64="Jkv8keMS0FhjcfbzwX5TGOOML7Q=">AAADJnicbVJNj9MwEHXDxy7lq4UjF4sKablUCVoB2tMKLhwXie4WNaWauE5j6tiRPSZUUX4KV+DXcEOIG38ECaetgKY7kqWneW884/FLCikshuHPTnDl6rXrB4c3ujdv3b5zt9e/d261M4yPmJbajBOwXArFRyhQ8nFhOOSJ5BfJ8mXDX3zgxgqt3uCq4NMcFkqkggH61KzXjzNtwC1mVXo0fnxC39az3iAchuug+yDaggHZxtms3/kdzzVzOVfIJFg7icICpxUYFEzyuhs7ywtgS1jwiYcKcm6n1Xr2mj7ymTlNtfFHIV1n/6+oILd2lSdemQNmts01ycu4icP0+bQSqnDIFds0Sp2kqGmzCDoXhjOUKw+AGeFnpSwDAwz9una6JFovERJbd7ux4iXTeQ5qXsVa11WM/CMmaaXrepdMsZ5E0+qvYBDVe5J/5dDmdOFJrqwzvHkajZOU6pZm83NeB7LI4F0VG7HIEIzRZfs6b4Fdqd+blP7bSnWp/r0WyqsTXaLgLc4p7xxPukK6ZifeMFHbHvvg/Mkwejo8fn08OH2xtc4heUAekiMSkWfklLwiZ2REGCnJJ/KZfAm+Bt+C78GPjTTobGvuk50Ifv0By7kNaw==</latexit>

!
f(X):Y
<latexit sha1_base64="KZhlQb7RQuvbDZlIWBGjNXy9o1c=">AAADPnicbVLLjhMxEHSGxy7hlYUjF4sIKblEGbSCFaflceC4ILK7UiZEPY4nMfHYI7tNiKz5Br6GK/Ab/AA3xJULEp4kAjLZliyVu6rddrvSQgqL/f63RnTp8pWre/vXmtdv3Lx1u3Vw59RqZxgfMC21OU/BcikUH6BAyc8LwyFPJT9L588r/uw9N1Zo9QaXBR/lMFUiEwwwpMatbjLTBtx07LNx3Eky9E+nvOw+oRWEKX8NKuzLzovuuNXu9/qroLsg3oA22cTJ+KDxO5lo5nKukEmwdhj3Cxx5MCiY5GUzcZYXwOahzTBABTm3I796U0kfhMyEZtqEpZCusv9XeMitXeZpUOaAM1vnquRF3NBhdjTyQhUOuWLrRpmTFDWtBkQnwnCGchkAMCPCXSmbgQGGYYxbXVKt5wipLZvNRPEF03kOauITrUufIP+AaeZ1WW6TGZbDeOT/CtpxuSP5Vw51TheB5Mo6w6un0STNqK5p1j8adCCLGbz1iRHTGYIxelE/LlhjWxrmJmX4toW6UP9OCxXUqV6g4DXOqeCoQLpCumomwTBx3R674PRhL37UO3x12D5+trHOPrlH7pMOicljckxekhMyIIx8JJ/IZ/Il+hp9j35EP9fSqLGpuUu2Ivr1B33rF1I=</latexit>

!
f1(Age):ageRange
(D)
group by gender
avg(age)
Horizontal augmentation
<latexit sha1_base64="/Fez8VR4cSmlF01/YiVQsD5zSEs=">AAADP3icbZLNjtMwEMfd8LWUj+3CkUtEhdTlUDXVChAS0vIlOC4S3V2pKZHjTlpTx47sMaWK/A48DVfgMXgCbogrBySctoJtuiNF+mf+v4njmUkLwQ32et8bwYWLly5f2bnavHb9xs3d1t6tY6OsZjBgSih9mlIDgksYIEcBp4UGmqcCTtLZ88o/+QDacCXf4qKAUU4nkmecUfSppHX/ZdJ/EnuC2klSxhmWr0COQbvHWdLvVO9PJ+D2XefFftJq97q9ZYTbIlqLNlnHUbLX+BOPFbM5SGSCGjOMegWOSqqRMwGuGVsDBWUzOoGhl5LmYEbl8lIuvOcz4zBT2j8Sw2X2bEVJc2MWeerJnOLU1L0qeZ43tJg9GpVcFhZBstVBmRUhqrDqUDjmGhiKhReUae7/NWRTqilD38eNU1KlZkhT45rNWMKcqTynclzGSrkyRviIaVYq5zbNDN0wGpX/gHbktpD/5bTuqcKbII3VUF0tjNMsVDVmqqpxeo6KYkrflbHmkylSrdW8/rnV5M+gvm9C+LHN5bn8e8Wlp1M1Rw41z0q/Ut60hbBVT/zCRPX12BbH/W70oHvw5qB9+Gy9OjvkDrlLOiQiD8kheU2OyIAw8ol8Jl/I1+Bb8CP4GfxaoUFjXXObbETw+y86OxeP</latexit>
E2=↵
#
Gender:f2(Age)
(D)
<latexit sha1_base64="bJJOqZd/k6cJtV5UgJsl0/znBVA=">AAADKHicbZJNj9MwEIbd8LFL+eqyRy4WFVL3UiWIjxWnFXDguEh0t6gJ1cR1WlPHjuwxpYryW7gCv4Yb2iv/AwmnrWCb7kiRXs3zju3MTFpIYTEML1rBtes3bu7t32rfvnP33v3OwYMzq51hfMC01GaYguVSKD5AgZIPC8MhTyU/T+eva37+mRsrtHqPy4InOUyVyAQD9Klx5zD2FNx0XA5fZr0PR1XvzdG40w374Srorog2oks2cTo+aP2JJ5q5nCtkEqwdRWGBSQkGBZO8asfO8gLYHKZ85KWCnNukXL2+oo99ZkIzbfynkK6ylytKyK1d5ql35oAz22R18io2cpgdJ6VQhUOu2PqizEmKmtatoBNhOEO59AKYEf6tlM3AAEPfsK1bUq3nCKmt2u1Y8QXTeQ5qUsZaV2WM/AumWamrahtmWI2ipPxn6EbVjuV/OTSZLjzkyjrD61+jcZpR3fDMdD077wNZzOBjGRsxnSEYoxfN49ZjvmT1fZPSj22hrvR/0kJ5d6oXKHiDOeV3x0NXSFf3xC9M1FyPXXH2pB897z9797R78mqzOvvkIXlEeiQiL8gJeUtOyYAwsiRfyTfyPfgR/Ax+BRdra9Da1BySrQh+/wVrcA35</latexit>

#
X:f(Y)
(D)
DATAPLAT@ICDE 2024

12
Data transformation
<latexit sha1_base64="XtRrctBkqIU93sb+UHrmtJtjUkA=">AAADHnicbZJNbxMxEIad5aMlfLVw5LIiQiqXaBdVwLGCC8cikTbSbojGjjdr4rVX9rghsvZncAV+DTfEFX4MEt40ArLpSJZezfuMP8ZDayksJsmvXnTt+o2be/u3+rfv3L13/+DwwZnVzjA+YlpqM6ZguRSKj1Cg5OPacKio5Od08br1zy+4sUKrd7iq+aSCuRKFYIAhleUIbuqLo/HTZnowSIbJOuJdkW7EgGzidHrY+53PNHMVV8gkWJulSY0TDwYFk7zp587yGtgC5jwLUkHF7cSv79zET0JmFhfahKUwXmf/r/BQWbuqaCArwNJ2vTZ5lZc5LF5OvFC1Q67Y5UGFkzHquG1APBOGM5SrIIAZEe4asxIMMAxt2jqFar1AoLbp93PFl0xXFaiZz7VufI78I9LC66bZNgtssnTi/wKDtNlB/pVD19N1MLmyzvD2aXFOi1h3mFIbcPPAgaxLeO9zI+YlgjF62d0ufP02GvomZfi2pbqS/6CFCjTVSxS84zkVJiaYrpau7UkYmLQ7Hrvi7NkwfT48fns8OHm1GZ198og8JkckJS/ICXlDTsmIMKLJJ/KZfIm+Rt+i79GPSzTqbWoekq2Ifv4B+VsLDw==</latexit>

f(X)
<latexit sha1_base64="Q7sjzw3r7FpZN6MWGMj9azYMGFk=">AAAD5HicbVJLb9NAELYbHiW8WjhyWREjFQlFccXrWAEHjkWiDykO0ex6N1663rX20RBZ/gfcEFf+Emd+DBKzaQQk7Vw8O9/3zYxnhjZKOj8a/Uq3eteu37i5fat/+87de/d3dh8cOxMs40fMKGNPKTiupOZHXnrFTxvLoaaKn9CztxE/OefWSaM/+kXDJzXMtBSSgcfQdOdnoY3UJdee+IqTwvMvmKX1FrQTxtZLWkeMIEAc99ERHHyw3JHsNIvv7F1GgpN6hhQRNIsKkomMFAWRjhjqAZsrCV0QF6jD9MFHNgdWkXNQgZOsnLayEF1G5tJXKN7DQAHOY+xp9ixmwmYuFKvyJCuwhGEsWBuzSR37GU53BqPhaGnkspOvnEGyssPpbvq7KA0LNY6AKXBunI8aP2nBeskU7/pFcLwBdgYzPkZXQ83dpF1OviNPMFIuexMGR7iM/q9ooXZuUVNk4igrt4nF4FXYOHjxetJK3QTPNbsoJIIi3pC4RlJKy5lXC3SAWYm9ElaBBeZx2WtVqDFnHqjr+v1C8zkzdQ26bAtjuna5bipa03XroPDdOJ+0fwmDvLtE+SeHTcw0CHLtcE/x10hBBTEbnMpYCDPkgWoq+NQWVs4qD9aa+WY6POB1Ks5NKVzbXF/J/4wnjWxq5l7yDSzoeND4bVSIM8GDyTfP47JzvD/MXw6ff9gfHLxZnc528ih5nOwlefIqOUjeJ4fJUcLSF+k4LVPeE72vvW+97xfUrXSleZisWe/HH5yBTeM=</latexit>
thetransformationof a set of featuresXofDusing a functionf
is obtained by substituting each valuediawithf(d⇤a),
for each featureaoccurring inX.
Example: data imputation. Here f replaces nulls with the most frequent value, for
column Zip
<latexit sha1_base64="dKf0psuUtfBq7WDfOX5DpzZK5ls=">AAADKnicbZLNjtMwFIXd8DeUvw6IFZuICqmzqRI0ApYjYMFykOjMiCZUN67Tmjp2ZF9TKssPwxZ4GnYjtrwGEk6nAprOlSId3fPd2Lk5RS24wSQ570RXrl67fmPvZvfW7Tt37/X2758YZTVlI6qE0mcFGCa4ZCPkKNhZrRlUhWCnxeJV459+YtpwJd/hqmZ5BTPJS04BQ2vSe5gh2IkrB1mJ7j2v/YEfvD6Y9PrJMFlXvCvSjeiTTR1P9ju/s6mitmISqQBjxmlSY+5AI6eC+W5mDauBLmDGxkFKqJjJ3fr+Pn4SOtO4VDo8EuN19/8JB5Uxq6oIZAU4N22vaV7mjS2WL3LHZW2RSXpxUGlFjCpulhFPuWYUxSoIoJqHu8Z0DhoohpVtnVIotUAojO92M8mWVFUVyKnLlPIuQ/YZi9Ip77fNEv04zd1foJ/6HeTfOLQ9VQeTSWM1az4tzooyVi1mrjTYWeBA1HP44DLNZ3MErdWy/boQg2007E2I8NuW8lL+o+Iy0IVaImctz8qQnmDaWthmJyEwaTseu+Lk6TB9Njx8e9g/ermJzh55RB6TAUnJc3JE3pBjMiKUOPKFfCXfou/Rj+g8+nmBRp3NzAOyVdGvPwnyD0I=</latexit>

f(Zip)(D)
DATAPLAT@ICDE 2024

13
Data fusion: join and append
<latexit sha1_base64="uo1XC2O2rrqRH/7jgx2X/lPakP4=">AAADKHicbZLNbtNAFIUn5q+Ev5Qu2YyIkFhFNqoKy6rtggWLgkhbKXai68k4HjKesWbuNESWn4Ut8DTsULe8BxLjNALi9EqWju75rmd8fdJSCotheNUJbt2+c/fezv3ug4ePHj/p7T49s9oZxodMS20uUrBcCsWHKFDyi9JwKFLJz9P5ceOfX3JjhVYfcVnypICZEplggL416e2djN/R+JMWaoyT6rimJ+MPk14/HISrotsiWos+WdfpZLfzO55q5gqukEmwdhSFJSYVGBRM8robO8tLYHOY8ZGXCgpuk2p1+5q+8J0pzbTxj0K66v4/UUFh7bJIPVkA5rbtNc2bvJHD7E1SCVU65IpdH5Q5SVHTZhV0KgxnKJdeADPC35WyHAww9AvbOCXVeo6Q2rrbjRVfMF0UoKZVrHVdxcg/Y5pVuq43zQzrUZRUf4F+VG8h/8ah7enSm1xZZ3jzaTROM6pbTK4NuJnnQJY5jKvYiFmOYIxetF/nQ7CJ+r1J6X/bQt3IN5HwdKoXKHjLc8pnx5uulK7ZiQ9M1I7Htjh7NYgOBvvv9/uHR+vo7JBn5Dl5SSLymhySt+SUDAkjS/KFfCXfgu/Bj+BncHWNBp31zB7ZqODXH8rzDh4=</latexit>
D
L
./
t
CD
R
<latexit sha1_base64="fiWoK5ivN8nYSDBQRhG2qdf4NTc=">AAADIXicbZJNbxMxEIad5auErxaOXCwiJE7RLqoKxwp64MChINJWym6qsePNmnjtlT0mRKv9H1yBX8MNcUP8FiS8aQRk05EsvZr3GX+Mh1VKOozjn73oytVr12/s3Ozfun3n7r3dvfsnznjLxYgbZewZAyeU1GKEEpU4q6yAkilxyuYvW//0g7BOGv0Ol5XISphpmUsOGFKTo8lrmnodJD2avD3fHcTDeBV0WyRrMSDrOD7f6/1Op4b7UmjkCpwbJ3GFWQ0WJVei6afeiQr4HGZiHKSGUrisXl27oY9DZkpzY8PSSFfZ/ytqKJ1bliyQJWDhul6bvMwbe8yfZ7XUlUeh+cVBuVcUDW17QKfSCo5qGQRwK8NdKS/AAsfQqY1TmDFzBOaafj/VYsFNWYKe1qkxTZ2i+Igsr03TbJo5NuMkq/8Cg6TZQv6VQ9czVTCFdt6K9mk0ZTk1HaYwFvwscKCqAiZ1auWsQLDWLLrbhd/fREPflArfttCX8u+N1IFmZoFSdLzVpATTV8q3PQkDk3THY1ucPB0mB8P9N/uDwxfr0dkhD8kj8oQk5Bk5JK/IMRkRTiz5RD6TL9HX6Fv0PfpxgUa9dc0DshHRrz8U/gvI</latexit>
D
L
]D
R
<latexit sha1_base64="ZSc/aIuuYda02WJ0QVQW8PzBr8E=">AAADIHicbZJNbxMxEIad5auErxaOXCwiJE7RLqqAYwU9cOBQEGkrZTfV2PFmTbz2Yo8J0Wp/B1fg13BDHOG/IOFNIyCbjmTp1bzP+GM8rFLSYRz/7EWXLl+5em3nev/GzVu37+zu3T12xlsuRtwoY08ZOKGkFiOUqMRpZQWUTIkTNn/R+icfhHXS6Le4rERWwkzLXHLAkMoOJ69Sr4Oih5M3Z7uDeBivgm6LZC0GZB1HZ3u93+nUcF8KjVyBc+MkrjCrwaLkSjT91DtRAZ/DTIyD1FAKl9WrWzf0YchMaW5sWBrpKvt/RQ2lc8uSBbIELFzXa5MXeWOP+bOslrryKDQ/Pyj3iqKhbQvoVFrBUS2DAG5luCvlBVjgGBq1cQozZo7AXNPvp1osuClL0NM6NaapUxQfkeW1aZpNM8dmnGT1X2CQNFvIv3LoeqYKptDOW9E+jaYsp6bDFMaCnwUOVFXApE6tnBUI1ppFd7vw+Zto6JtS4dsW+kL+nZE60MwsUIqOt5qUYPpK+bYnYWCS7nhsi+PHw+TJcP/1/uDg+Xp0dsh98oA8Igl5Sg7IS3JERoST9+QT+Uy+RF+jb9H36Mc5GvXWNffIRkS//gCWmQue</latexit>
D
L
]D
R
<latexit sha1_base64="Tf7s3qEix3yKzKbh9vcpsGLm1tk=">AAADSXicbVLdihMxGE2n/qz1r6uX3gSL4FWZkaLeCIu7FwperGJ3FzrTkkkzbWwmGZIv1hLyIj6Nt+oT+BjeiSCY6ZbVTveDgZNzzpdMvpy8EtxAHP9oRe0rV69d37vRuXnr9p273f17J0ZZTdmQKqH0WU4ME1yyIXAQ7KzSjJS5YKf54rDWTz8ybbiS72FVsawkM8kLTgkEatIdHI3fpB8Ul2OXmgJzKZn2ExfYflqAO3w99S+Oxu8uFh6H1aTbi/vxuvAuSDaghzZ1PNlv/UmnitqSSaCCGDNK4goyRzRwKpjvpNawitAFmbFRgJKUzGRufT2PHwVmigulwycBr9n/OxwpjVmVeXCWBOamqdXkZdrIQvE8c1xWFpik5wcVVmBQuJ4VnnLNKIhVAIRqHv4V0znRhEKY6NYpuVILILnxnU4q2ZKqsiRy6lKlvEuBfYK8cMr7bbEAP0oyd2HoJX7H8q+dNDVVBZFJYzWrr4bTvMCq4ZkrTews+Iio5iS8seazORCt1bK5XUjJtjXMTYjwbEt5qb8OTXDnagmcNTQrQ7iCaCth65mEwCTNeOyCkyf95Gl/8HbQO3i5ic4eeoAeoscoQc/QAXqFjtEQUfQZfUFf0bfoe/Qz+hX9PrdGrU3PfbRV7fZfdvcasg==</latexit>
D
L
./
inner
D
L
.CId=D
R
.CId
D
R
DATAPLAT@ICDE 2024

14
Conceptual provenance capture model: templates
<latexit sha1_base64="Q+fPf+TzQY7bxgC074TZYQmdfIg=">AAAKYHicjZZfb9s2EMDldn9Sr12T7W17IRYES7E1s4cWG/ZUZ83SAEXiFUlbIPYMSjrJRClSIym7hqAPucc97GWfZEfZiylK7SbAAI/3uzuSdzw6zDnTZjD4s3fr9gcffvTxzp3+J3fvfXp/d++zl1oWKoKrSHKpXodUA2cCrgwzHF7nCmgWcngVvvnZ6l8tQGkmxaVZ5TDNaCpYwiJqcGq2u5wIWEYyy6iIy0liquvhtCwnBt6aMCn3h1VV9RvIXCpapFU5oTyf09/KiWLp3FCl5NKia/WsTGbDQ3RXjlKoHvxE7JCm8IIKlKvDpw9mu/uDo0H9kfZguBnsB5tvPNvb+WMSy6jIQJiIU62vh4PcTEuqDIs4YOhCQ06jNxjmGoeCZqCnZX1CFTnAmZgkUuFPGFLPuhYlzbReZSGSGTVz7evsZJfuujDJj9OSibwwIKJ1oKTgxEhij5vETEFk+AoHNFIM10qiOVU0MpiUPn6Nw82VXODR2jDMlLXkHX8M+RawQuV7cO19rQyTrZqGumUtuWOOQuUvMN3qx6e+OYit9kT4WtzzVj1Cwdc7vkdpK7TNIALPN0Qteh6WabhykGV6vPIRJhJeYKbA5ag++3c6bpss46QJPwXFFhD/omTWYumywS5bmzRGXcoG0zoIO/VeIAYOKTXuHmgo227sto5XrZ1KlXXtU+MtdgOvZT/Fbg5A2BR4eTpxKvikVb/YnKSKwVYpDiP4/R36XLEMbqCv/WKARUadi7AWW0sRMoYtdG4lL5y9o1uilvwNcxqCcyvWor+etDgZ2dVi50uL74BWFSEHJAXxsNDYJojEHkywczHQ3xK8CGzB7Nh3whwvrHbjE/RkdHqDfIOIvSnNWEiSTKp3BiU5LzTBxiwMdqCDAyJzUNRI5S9HycI549NabN1Zj6JpJ6e7nM2wxFTrgYm41E4TqkUPUZA7GbESjVqJo2kTW8sdoIIlNmnXXy37dfDW2HLfltxa9sv3f60+c4NlmKautdfzTkN80UlKha+w8OmLerbbgnLebTTi/H12iJ9pybHtxI3l30z6eZT/nSPl3y4fGNt3d/vijC6f+cTZ+fkWmCywZc3B0Bk+ya3Kuri67ERlYVrs2Xk3y0SbjZtJj7uSPr547viLKCfjqsI/QUP/L0978PL7o+Hjo8Gvj/afHG/+Du0EXwZfBYfBMPgheBI8C8bBVRAFf/Vu9+727t35u7/Tv9/fW6O3ehubz4PG1//iH9y29FY=</latexit>

!
f1(Age):ageRange
(D)
A different provenance template pt$ is associated with each type # of operator

15
Capturing provenance: bindings
At runtime, when operator o of type ! is executed, the appropriate template pt! for ! is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_VXbVHW_cROXPQV.VYJ
?OH:///UVHUV/QSP65/DRZQORaGV/03_VXbVHW_cROXPQV.VYJ 1/1
14/03/2021 03_VXbVHW_cROXPQV.VYJ
?OH:///UVHUV/QSP65/DRZQORaGV/03_VXbVHW_cROXPQV.VYJ 1/1
op
{old values: F, I, V} à {new values: F’, J, V’}
+
Binding rules
<latexit sha1_base64="icVdmbcCfxxYOiITpBtlS3uqwUQ=">AAAD+HicdZNdb9MwFIaTlY8RPtbBJTdHVJQhVVWDJkCTKk2AJsbVkOjWqQ6V4zqtmWNHtrOuBP8X7hC3/Buu+R1IOGkFbTccKTo672O/yTnHccaZNp3OT3+jdu36jZubt4Lbd+7e26pv3z/WMleE9ojkUvVjrClngvYMM5z2M0VxGnN6Ep+9LvWTc6o0k+KDmWU0SvFYsIQRbFxqWP8VNAEZemGKA6nAAtsLAY2k0SD2mggFTQTVUyHVm5ki13RkgQrT3rMwQByLMadwAF3oD9MWHHZZC467b4YFa7mEBaTmxBcoAUBMQB8i+N/xYyqowmbJY8nkiXM5HU5a8K50Oe8mO3Mf+3TZxhGVzSlEQTCsNzrtTrXgchAugoa3WEfDbf+3qwHJU2dPONZ6EHYyExVYGUY4tQFyFcgwOcNjOnChwCnVUVF1w8LjsjyQuHImUhiosss7CpxqPUtjR6bYTPS6Viav0ga5SV5GBRNZbqggc6Mk52AklK2FEVOUGD5zASaKuW8FMsEKE+MGYMUllvLM4FjbIECCTolMUyxGBZLSzrsQJ4W0dlVMjB2EUfEXaIT2EvJvO17XZOZEKnSuaPlrgOIE5BozkQrnY8dhnk3wxwIpNp4YrJScrh/nhnoVdXXj3LVtKq7kP0kmHB3LqWF0TcuFuwtOzDOelzVxAxOuj8fl4PhZO3ze3n2/29h/tRidTe+h98jb8ULvhbfvvfWOvJ5H/ENf+hf+rPa59rX2rfZ9jm74iz0PvJVV+/EHk+tPwQ==</latexit>
Fori:1...n:
usedent.:[hF=Xm,I=i, V=Di,Xm
i|Xm2X]
generatedent.:[hF
0
=Yh,J=i, v=f(Di,X)i|Yh2Y]

16
Implementation by shape and value diff
Shape changes:
RowsAdded?
RowsRemoved?
ColumnsAdded?
ColumnsRemoved?ColumnsRemoved?
HorizontalAugmentationReduction by selectionReduction by projection
data transformation(composite)
Y
Y
Y
Y
data transformation
YN
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:data transformation(imputation)
data transformation1-1 derivations
For each input/output pair Din, Dout of dataframes:
1.Diff both shapes and values of Din, Dout
2.Use the diff to:
•Select the appropriate template
•Bind the template variables using the
relevant values in the two dataframes
•Generate an instantiated provlet
DATAPLAT@ICDE 2024

17
Running Example
D1D2D3
Add
‘E4,’ ‘Ex’, ‘E1’Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)Impute E,F
D5
<latexit sha1_base64="vtTzVqyQbOaTVii0idD+QwwhSJQ=">AAAEKXicfVNbb9MwFE5WLqPcNvbIi0UFalE0NV03EFKl3Toh7WVI7CLVxXJcpzVz7Mh26IqV/8Ir8Gt4A175HUg4bRltN3GkREfn+853Ep/PUcqZNvX6D3+pdOPmrdvLd8p3791/8HBl9dGJlpki9JhILtVZhDXlTNBjwwynZ6miOIk4PY3O9wr89ANVmknx1oxS2k1wX7CYEWxcCa36a8DFM7CPwtY+wgC+l0y8s9AYwGlscmQPURgcokbuKBGE5b/0RgsanCEbo7D6vJZXnUBtBt5wao3/q5EZevNSrVFtBwdjvY1ZPbuZt+BAKpz1kR1U27VX0LZRMwBtdFG8QpgXPc3Znq0WTBmy0O4UnN0A7KBRAPYDsBeAg6Jppj0AEwE3p1ZGK5X6en0c4GoSTpOKN40jd4y/YU+SLKHCEI617oT11HQtVoYRTvMyzDRNMTnHfdpxqcAJ1V07Xl8OnrpKD8RSuUcYMK7OdlicaD1KIsdMsBnoRawoXod1MhO/7Fom0sxQQSaD4owDI0HhBdBjihLDRy7BRDH3rYAMsMLEOMfMTYmkPDc40nm5DAUdEpkkWPQslDJ326UXJoqtzPN50C28E3btJaES5lco/9rxIiZTB1KhM0WLXwMwioFc4Ewc4XiYpwPsnKZYf2CwUnK4KOduwTzVnRvnbm1DcS2/sK5jR3JoGF3AMuEujwOzlGfFmTjDhIv2uJqcNNbDrfXNN83K9u7UOsveY++JV/VC74W37b32jrxjj/gf/U/+Z/9L6WvpW+l76eeEuuRPe9a8uSj9+gO5hVWq</latexit>
D1=Da./
left
K1,K2
Db
D2=⌧
f1(⇤)(D1)
D3=D2./
left
K1,K2
Dc
D4=⌧
f2(E,F)(D3)
D5=↵
!
h(E):{E4,Ex,E1}
(D4)
D6=⇡
{Ax,B,Ay,D,C,F,E4,Ex,E1,}(D5)
DATAPLAT@ICDE 2024

18
Running Example
D1D2D3
Add
‘E4,’ ‘Ex’, ‘E1’Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)Impute E,F
D5
df = pd.merge(df_A, df_B, on=['key1', 'key2'], how='left’) # join
df = df.fillna('imputed’) # Imputation
df = pd.merge(df, df_C, on=['key1', 'key2'], how='left’) #join
df = df.fillna(value={'E':'Ex', 'F':'Fx’}) # Imputation
# one-hot encoding
c = 'E'
dummies = []
dummies.append(pd.get_dummies(df[c]))
df_dummies = pd.concat(dummies, axis=1)
df = pd.concat((df, df_dummies), axis=1)
df = df_A.drop([c], axis=1)
DATAPLAT@ICDE 2024

19
Running Example
D1D2D3
Add
‘E4,’ ‘Ex’, ‘E1’Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)Impute E,F
D5
DataframesDiff template
D1 ß {Da, Db} Explicit join provenance pattern
D2 ß D1value change, reduced nulls à imputationData transformation
D3 ß {D2, Dc} Explicit join provenance pattern
D4 ß D3value change, reduced nulls à imputationData transformation
D45 ß D4Shape change, column(s) added<wait!>
D6 ß D5Shape change, column(s) removedData transformation, composite
DATAPLAT@ICDE 2024

20
Program level transparency with control
Approach:
- add an observer to monitor dataframe changes
- mostly transparent to application
- some control over Tracker surfaced
DATAPLAT@ICDE 2024

21
Provenance traversals – example
Capture, store and query element-level provenance
-Derivation of each element of each intermediate dataframe (when possible)
-Efficiently, at scale
fillna
Join
df_1
df_B (df_0)
df_A (df_-1)
DATAPLAT@ICDE 2024

22
Benchmarking: data x pipelines
Datasets:
Pipelines:
Provenance graphs are stored
in a single Neo4J database
DATAPLAT@ICDE 2024

23
Results
The PT/PO ratio provides a rough indication of scalability:
- The graphs for the complete pipelines are close in size to the sum of the sizes of the components’
graphs
1,2,3: pipeline number
DATAPLAT@ICDE 2024

24
Conclusions
üDPDS generates granular provenance graphs that accurately represent the
underlying data processing
üA potentially useful building block towards explanations in a Data Centric AI
setting
Limitations:
vNo granularity control --> limited scalability
vOperates only on Pandas dataframes
DATAPLAT@ICDE 2024