(Explainable) Data-Centric AI: what are you explaininhg, and to whom?

pmissier 224 views 25 slides May 15, 2024
Slide 1
Slide 1 of 25
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25

About This Presentation

A talk given to the
AIM Research Support Facility @ the Turing Institute


Slide Content

Prof. Paolo Missier
School of Computer Science
University of Birmingham, UK
May,2024
A talk given to the
AIM Research Support Facility @ the Turing Institute
(Explainable) Data-Centric AI
My contacts:

2
Data-centric AI
End-to-end processing from data sources to model outputs
[1] Seedat, Nabeel, Fergus Imrie, and Mihaela van der Schaar. ‘DC-Check: A Data-Centric AI Checklist to Guide the Development of Reliable Machine
Learning Systems’. arXiv, 9 November 2022. http://arxiv.org/abs/2211.05764.
Credit: Andrew Ng
Landing.ai
AIM RSF May 2024

3
Outline
ØModel training and data interventions are becoming entangled
This is good! But:
ØModel-based explanations and data-based explanations should merge, too
§Data explanations → data provenance??
§Contextual explanations: the case of healthcare
ØSome ideas on how this can be achieved
AIM RSF May 2024

4
DCAI involves extended feedback loops
[5] Singh, Prerna. ‘Systematic Review of Data-Centric Approaches in Artificial Intelligence and Machine Learning’. Data Science and Management 6,
no. 3 (1 September 2023): 144–57. https://doi.org/10.1016/j.dsm.2023.06.001.
AIM RSF May 2024

5
Rapidly emerging literature
[2] Mohammad Hossein Jarrahi, Ali Memariani, and Shion Guha. 2023. The Principles of Data-Centric AI. Commun. ACM 66, 8 (August 2023), 84–92.
https://doi.org/10.1145/3571724
[3] Zha, Daochen, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, and Xia Hu. ‘Data-Centric AI: Perspectives and Challenges’. arXiv, 2 April 2023.
http://arxiv.org/abs/2301.04819.
[4] Zha, Daochen, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu. ‘Data-Centric Artificial Intelligence: A
Survey’. arXiv, 11 June 2023. https://doi.org/10.48550/arXiv.2303.10158.
Source: [3]
Source: [2]
AIM RSF May 2024

6
Example: incremental label correction
Aim: to develop data performance benchmarks for ML
Complementing MLPerf benchmaks
Both part of ML Commons
https://www.dataperf.org/
[10] Mazumder, Mark, Colby Banbury, Xiaozhe Yao, Bojan Karlaš, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, et al. ‘DataPerf: Benchmarks for
Data-Centric AI Development’. arXiv, 13 October 2023. https://doi.org/10.48550/arXiv.2207.10062.
Benchmarks emerge through challenges:
demonstrate how model performance can be enhanced through data interventions
AIM RSF May 2024

7
Correcting mislabelling: challenge
Context: Vision dataset for image classification tasks
Given: a training set Dtr of labelled images and a classification task T
images annotated with
image-level labels,
object bounding boxes,
object segmentation
masks…
Training data from
OpenImage V7 dataset
Scenario: realistically, some of the labels in Dtr are noisy
- Challenge: suggest a strategy to achieve the minimal number of label
fixing (“cleaning”) to achieve a target performance gain relative to P
P = Perf(M(Dtr))
Best performance when model M is trained on a perfectly labelled Dtr
and evaluated on independent test set Dtest
https://www.dataperf.org/training-set-acquisition/acquisition-overview
AIM RSF May 2024

8
Data Cleaning simulation pattern
cleaning
priority
strategy
D’
Model
trainingM’
Model
eval
Dtrcorrupt
labelsDn
Fixed Training
code
Eval
Scoreclean
Model
training
Competitor sideEvaluator sideA noisy version Dn is generated from
Dtr (eg label flipping)
Target performance recorded by
training on Dtr and testing on Dtest
Strategies are scored based on number
of cleaning actions required to achieve
95% of target performance
- Labelling strategy: ranking of examples in Dn to be cleaned to achieve performance close to P
AIM RSF May 2024

9
Data-X: explaining the data side of ML/AI
Training
datasets
Source
datasetsData processing
Data-X
Data explanation questions:
ØHow was a dataset processed, step-by-step
ØWhole dataset --> individual items
ØWhich ones?
ØWhy was a specific data item transformed?
Data-Centric AI: complex data transformations and filtering
-Model-driven data cleaning
-Model-driven training set optimization
-…
AIM RSF May 2024
Model
outputsTraining
Model-X

10
In our running example…
CSi
Di
Model
trainingM’
eval
Dn
Mbest
MLOps
We would like to:
1.Document that Di was derived from Dn using
CSi, as part of a longer pipeline
2.Be able to identify:
1.What effect CSi had on Dn:
1.Which data labels were cleaned
2.Why they were cleaned
3.Make sure CSi can be reused safely:
1.Specify assumptions, pre-requisites
2.Provide examples of past usages
AIM RSF May 2024
CSi = version i of some Cleaning Strategy

11
Mission: make new data-centric algorithms explainable, reusable
problem instances
Prov-DB
DataTrainingOps
Enable
reuse
Observe /
recordReproduce /
explain
Curated
Data toolkit
Goals: to support
•Reusability and emerging best practices for
complex data intervention + usage patterns
•Reproducibility, explainability of pipeline instances
How:
- Enable data processing observations / capture
- Build a curated catalogue of interventions + usage patterns
Challenges:
- Observability: Instrumenting common runtime for transparent capture
- Granularity: explanations need to be pitched at the right level for different stakeholders

12
Representing provenance
A formal, interoperable data model
and syntax for generic provenance
constructs
- extensible to domain vocabularies
AIM RSF May 2024

13
Provenance layer I: whole dataset
Assumptions:
- Dn, Di atomic units of data
- CS atomic unit of processing
Reproducibility: “Outer layer” questions
- Where does Di come from?
- Which version Di was used to train Mbest?
Derivation:
Di was derived from Dn using CSi
Mbest was trained on Di
Attribution:
CSi was created by <Actor A>
Dn DiCSi
wasGeneratedByused
A
wasAssociatedWith
wasDerivedFrom
Di-1 D’Ai
wgby
wasDerivedFrom
usedAi-1
usedD
AIM RSF May 2024

14
Provenance layer II: data-granular provenance
Assumptions:
- Dn = {xnj}, Di = {xij
- CS atomic unit of processing
Explainability: Data-level Questions
- which xnj were cleaned?
- “how dirty was Dn?”
in aggregate: how many labels were cleaned to
achieve a target performance?Item-level Derivations:
for each xij that has been cleaned by CSi:
xij was derived from xnj
xnj xijCS
wasGeneratedByused
C
wasAssociatedWith
wasDerivedFrom
Internal representation:
-Property-value graphs
-Neo4J
AIM RSF May 2024

15
How can we generate these provenance graphs?
Key idea: Interpreter-level observer
- Requires observer at the boundaries of each processor
- Observer has access to individual data items
Gregori, Luca, Paolo Missier, Matthew Stidolph, riccardo Torlone, and Alessandro Wood. Design and Development of a Provenance Capture Platform
for Data Science. In Procs. 3rd DATAPLAT Workshop, Co-Located with ICDE 2024. Utrecht, NL: IEEE, 2024.
A. Chapman, P. Missier, G. Simonelli, and R. Torlone. 2020. Capturing and querying fine-grained provenance of preprocessing pipelines in data
science. Proc. VLDB Endow. 14, 4 (December 2020), 507–520. https://doi.org/10.14778/3436905.3436911
A. Chapman, L. Lauro, P. Missier, and R. Torlone. 2022. DPDS: assisting data science with data provenance. Proc. VLDB Endow. 15, 12 (2022), 3614–
3617. https://doi.org/10.14778/3554821.3554857
Adriane Chapman, Luca Lauro, Paolo Missier, and Riccardo Torlone. 2024. Supporting Better Insights of Data Science Pipelines with Fine-grained
Provenance. ACM Trans. Database Syst. Just Accepted (February 2024). https://doi.org/10.1145/3644385
xj xijOp
wasGeneratedByused
actor
wasAssociatedWith
wasDerivedFrom
A starting point:
Data Provenance for Data Science (DPDS)
AIM RSF May 2024

16
Capturing provenance: sketch
Typical operator implementation:
- Pandas / Spark python pipeline / Dataframe datasets
- CS can be a method call or a code block:
1 - method call:

D’ = Op(D)
2 - Code block:
D à
à Di
“Begin Op”
--
--
--
“End Op”
D D’Op
wasGeneratedBy
wasDerivedFrom
used
wasDerivedFrom
used
wasGeneratedBy
OpD’trainD MLOps
Layer I (coarse): Process-level observer
M
AIM RSF May 2024

17
Pipeline -- Example
D1D2D3
Add
‘E4,’ ‘Ex’, ‘E1’Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)Impute E,F
D5
One-hot encoding
df = pd.merge(df_A, df_B, on=['key1', 'key2'], how='left’) # join
df = df.fillna('imputed’) # Imputation
df = pd.merge(df, df_C, on=['key1', 'key2'], how='left’) #join
df = df.fillna(value={'E':'Ex', 'F':'Fx’}) # Imputation
# one-hot encoding
c = 'E'
dummies = []
dummies.append(pd.get_dummies(df[c]))
df_dummies = pd.concat(dummies, axis=1)
df = pd.concat((df, df_dummies), axis=1)
df = df_A.drop([c], axis=1)
AIM RSF May 2024

18
Minimal code instrumentation
Approach:
- add an observer to monitor dataframe changes
- mostly transparent to application
- some control over Tracker surfaced
AIM RSF May 2024

19
Provenance traversals – example
Capture, store and query element-level provenance
-Derivation of each element of each intermediate dataframe (when possible)
-Efficiently, at scale
fillna
Join
df_1
df_B (df_0)
df_A (df_-1)
AIM RSF May 2024

20
Explain what? end-to-end
Model
outputs
Training
datasets
Source
datasetsData processingTraining
Model-X Data-X
Incorrect and correct predictions,
and examples selected to explain them
Lin, Jinkun, Anqi Zhang, Mathias Lécuyer, Jinyang Li, Aurojit Panda, and Siddhartha Sen. ‘Measuring the Effect of Training Data on Deep Learning
Predictions via Randomized Experiments’. In Proceedings of the 39th International Conference on Machine Learning, 13468–504. PMLR, 2022.
https://proceedings.mlr.press/v162/lin22h.html.
Provenance:
Where do these examples come from?
eg explain the data augmentation strategy
AIM RSF May 2024

21
Explain to whom?
Explanations should be contextualised / customized / adapted for different "stakeholders”
Example: healthcare
Two broad categories of stakeholders:
1. "AI clients”
-Health care professionals: GPs, specialists, nurses, administrators, policy makers, regulators
-Patients and public
-involved in the co-design of AI-based solutions
-not primarily involved in AI development and validation
2. Data and AI specialists: data controllers, data scientists, AI experts
Responsible AI:
- The two categories should work together (co-design)
- Each stakeholder will have a different role as part of the "DCAI loop"
AIM RSF May 2024

22
Model-X and Data-X differ across stakeholders
AI clients may hold specialists accountable for the trustworthiness of the final product
ØHealth care professionals:
-Expect quantified confidence in the model output: "epistemic humility”
ØRegulators, policy makers
-Expect evidence of model fairness
ØPatients and public:
-May accept trust by transitivity (eg I trust my doctor who trusts the system)
AIM RSF May 2024

23
Mapping the XDCAI space
What are you explaining?
Model
outputs
Training
datasets
Source
datasetsData processingTraining
Data controllers
Data Scientists
AI developers
Clinical professionals
- Doctors, Nurses, …
Patients & Public
Regulators
To whom?
Health admin
LIME, SHAP, occlusion testing
Influence functions, subgroup testing…
How has the training set been produced?
Why has data point X in the input been affected?
Contextualised end-to-end explanations”
AIM RSF May 2024

24
The bigger picture
Model
outputs
Training
datasets
Source
datasets
Data processingTrainingMInference/generation
contextualised explanations
Observe & record
Model explanations
DPDS as a
starting point
Socio-technical
Co-design
clientsspecialists
AIM RSF May 2024

25
Summary
XDCAI
Model
training
Model-X
Data-X
Data
interventions
Goals:
Make complex data interventions safely reusable and explainable
ØDemonstrate Data-X using layered provenance
ØCombine Model-x and Data-X
ØSupport contextualised explanations
AIM RSF May 2024