02 - Data validation and validity deze keer

Seyidali 11 views 43 slides Mar 05, 2025
Slide 1
Slide 1 of 78
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78

About This Presentation

Is it walid?


Slide Content

Software Testing and Engineering
for AI Systems (DSAIT4015)
Lecturers: Cynthia Liem and Annibale Panichella
1

Logistics
•Please enroll your project group (4 people) on Brightspace
•Let us know if we need to instantiate more groups
2

Data Validation and Validity
Lecturers: Cynthia Liem
3

What Can Go Wrong With Data?
4

What Can Go Wrong With Data?
5
I will not so much speak about database schema violations,
but rather about gaps between data and human interpretation

Pre-process
Labels
Optimized
Model
Data ML Training
Application

Pre-process
Labels
Optimized
Model
Data ML Training
Problem
Decisions

Is This a Dog?
8
Examples by Leonhard Applis
Source: https://nl.pinterest.com/pin/806003664560130745/ Source: https://www.istockphoto.com/nl/foto/wolf-pup-gm474625522-64803037

Is This a Dog?
9
Examples by Leonhard Applis

Is This a Dog?
10
Examples by Leonhard Applis

Is This a Dog?
11
Examples by Leonhard Applis
Source: https://disney.fandom.com/wiki/Goofy

Oracle Issues in Machine
Learning and Where To Find Them
Cynthia C. S. Liem and Annibale Panichella
12
https://dl.acm.org/doi/abs/10.1145/3387940.3391490

Use Case: Visual Object Recognition
13
Technology Review, 2014
Quartz, 2017

Visual Object Recognition
14
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂y
P( = goldfish) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0

x
x
x
x
x y
f(x)

Visual Object Recognition
15
x y
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂y
P( = goldfish) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0

x
x
x
x
f(x)

Visual Object Recognition
16
x y
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂y
P( = goldfish) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0

x
x
x
x
f(x)

Visual Object Recognition
17
x y
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂y
P( = goldfish) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0

x
x
x
x
f(x)

Visual Object Recognition
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂y
18
x y
P( = goldfish) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0

x
x
x
x
f(x)

19
Confident prediction
0,00
0,25
0,50
0,75
1,00
Goldfish
Beagle
Volcano
Shower Curtain
Looking at ̂y
Prediction not clear-cut
0,00
0,25
0,50
0,75
1,00
Goldfish
Beagle
Volcano
Shower Curtain

20
Shannon Entropy
H(̂y)=−

i
P(y=i|x)log
2
(P(y=i|x))

21
Low Entropy
0,00
0,25
0,50
0,75
1,00
Goldfish
Beagle
Volcano
Shower Curtain
Looking at ̂y
High Entropy
0,00
0,25
0,50
0,75
1,00
Goldfish
Beagle
Volcano
Shower Curtain

22
Figure from https://openscience.com/wordnet-open-access-data-
in-linguistics/
• Labels in object recognition
are not independent
• Pictures can contains multiple
objects
• Semantic relations
Semantic Information

Wordnet
23
https://wordnet.princeton.edu
• Synonyms: pairs of class labels that
have the same meaning.
• Homonyms: pairs of class labels
that are spelled and pronounced the
same, but that have different meanings
• Meronyms: pairs of class labels
linked by a part-of relation

ImageNet
24
• Large-scale hierarchical image database
• Crowdsourced annotation: Is there an [X] in this image?
• ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)
2012
• ‘The’ object recognition benchmark challenge
• New models benchmarked ‘on ImageNet’:
• Trained on ILSVRC2012 training set
• Evaluated on ILSVRC2012 validation set
(top-1 and top-5 accuracy)
• Well-performing model weights often released for use
in transfer learning

ImageNet
25
• Large-scale hierarchical image database
• Crowdsourced annotation: Is there an [X] in this image?
• ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)
2012
• ‘The’ object recognition benchmark challenge
• New models benchmarked ‘on ImageNet’:
• Trained on ILSVRC2012 training set
• Evaluated on ILSVRC2012 validation set
(top-1 and top-5 accuracy)
• Well-performing model weights often released for use
in transfer learning

Setup
• 4 ‘classical’ deep architectures (VGG16, VGG19, ResNet50, ResNet101)
• pre-trained on ILSVRC2012, weights released through Keras
• predictions run on al 50,000 ILSVRC2012 validation images
• application of original pre-processing methods
• we use our heuristics to surface striking outliers
26
VGG16 VGG19 ResNet50 ResNet101
90.0% top-5
accuracy
90.1% top-5
accuracy
92.1% top-5
accuracy
92.8% top-5
accuracy

High Entropy: Kneepad
27
• None of the models recognized
the ground truth class in the top-5
• All models consistently showed
high entropy in



̂y

High Entropy: Kneepad
• None of the models recognized
the ground truth class in the top-5
• All models consistently showed
high entropy in
• Due to standardization, only the
middle part of the image is offered
for prediction
̂y
28

Low Entropy: Bucket
29
• None of the models recognized the
ground truth class in the top-5
• All models consistently were convinced
(class probability of 1.0) that this image
should be labeled as baseball

Low Entropy: Bucket
• None of the models recognized the
ground truth class in the top-5
• All models consistently were convinced
(class probability of 1.0) that this image
should be labeled as baseball
• Shortcoming of single-class labeling
30

Synonyms: Laptop

• Frequent top-1 confusions between laptop and notebook
• Looking at class probabilities, models do not ‘see’ synonym classes as close together
31
Oracle Issues in Machine Learning and Where to Find Them ICSEW’20, May 23–29, 2020, Seoul, Republic of Korea
(a) Original (b) Cropped
vgg16 vgg19 ResNet50 ResNet101
wallet(0.4160) doormat(0.3504) doormat(0.8952) purse(0.7394)
doormat(0.2878) purse(0.2684) pencil box(0.0293)pencil box(0.0984)
purse(0.1625) wallet(0.1115) purse(0.0206) doormat(0.0975)
pencil box(0.0482)pencil box(0.0934) chest(0.0082) backpack(0.0143)
mailbag(0.0204) mailbag(0.0402) mailbag(0.0054) chest(0.0101)
(c) Predictions
Figure 3: Top-5 classi!cations for velvetimageILSVRC2012_val_00000433.
(a) Original(b) Cropped
vgg16 vgg19 ResNet50 ResNet101
laptop(0.9592) laptop(0.9796) laptop(0.9954) laptop(0.9984)
notebook(0.0346) notebook(0.0191) notebook(0.0042) notebook(0.0015)
iPod(0.0024) iPod(0.0004) space bar(0.0002) space bar(0.0000)
hand-held computer(0.0011)desktop computer(0.0002)computer keyboard(0.0000) mouse(0.0000)
modem(0.0007) space bar(0.0001) mouse(0.0000) computer keyboard(0.0000)
(c) Predictions
Figure 4: Top-5 classi!cations for laptopimageILSVRC2012_val_00007373.
(a) Original(b) Cropped
vgg16 vgg19 ResNet50 ResNet101
notebook(0.7222) notebook(0.7327) notebook(0.7230) notebook(0.8161)
laptop(0.1866) laptop(0.1178) laptop(0.1689) laptop(0.1492)
desktop computer(0.0244)desktop computer(0.0459)desktop computer(0.0420) modem(0.0100)
space bar(0.0097) space bar(0.0243) space bar(0.0239) space bar(0.0091)
solar dish(0.0092) hand-held computer(0.0152) mouse(0.0059) desktop computer(0.0041)
(c) Predictions
Figure 5: Top-5 classi!cations for laptopimageILSVRC2012_val_00002580.
4.3 Good performance vs. visual understanding
Our analysis surfaces various oracle issues, that globally hint at
issues with label taxonomies and problems with data encoding and
representation. Considering the original setup and context of the
ILSVRC2012 data, as an academic benchmark focused on assessing
the presence of certain object classes in images, this is not neces-
sarily a problem. As we showed in the previous subsection, many
‘mistakes’ made by our examined models can be explained by a
human and may not be true errors, rather signifying cases in which
the oracle may need to be reinterpreted. However, given the inter-
est in deploying well-performing models in real-world scenarios,
we want to point out that there still are conceptual discrepancies
between very good model performance based on the ILSVRC2012
data, and true visual understanding for safety-critical applications.
Models may exist that may yield even better performance than our
currently examined models within the ILSVRC2012 context and its
representation and evaluation framework, but that may never be
acceptable in practical scenarios, e.g. in automated computer vision
components for self-driving cars.
ILSVRC2012 is no balanced representation of the real world.Where
ImageNet seeks to provide a comprehensive visual ontology, the
ILSVRC2012 benchmark made particular benchmark-motivated
choices in picking the classes to be recognized. For example, as
ILSVRC2012 focused both on general and!ne-grained classi!ca-
tion, the latter was facilitated with more than100out of the1000
object classes corresponding to sub-species of dogs (e.g.miniature
poodle,standard poodle). However, it would be unrealistic to
assume that over10%of our real-world visual observations consider
sub-species of dogs.
Image classes in IlSVRC2012 are not independent. However, in the
way they are mathematically represented, it is implied they are.With
only one ground truth label per image, mathematically, the ‘ideal’
yfor a given image will be a one-hot encoded vector, withyi=1.0
for theicorresponding to the ground truth class, andyi=0.0
otherwise. In other words, classes are framed as independent. Thus,
mathematically, aminiature poodlewould be considered equally
far away to abeer bottleas to astandard poodle.
Maximum likelihood criteria will nudge models towards treating
the classes as independent.During the training of an ML classi!ca-
tion pipeline, the common criterion to optimize for is the likelihood
of the ground truth class, which should be maximized. With a sin-
gle ground-truth label being available per image, the best result in
terms of optimization therefore is to have a prediction con!dence
of1.0for a single class (and thus, a probability of0.0for other
classes), even if multiple classes are present. Thus, while abeach
wagontypically contains more than onecar wheel, if the!rst class
was the ground truth, optimization is considered to have succeeded
better if an ML system classi!es beach wagonwith1.0con!dence,
thus being ‘blind’ to the possible presence of car wheels.
Traditional!nal success assessment ignores prediction con!dence.
As noticed before, traditional ILSVRC2012 evaluation only cares
about the presence of the ground truth class in the top-1 or top-5:
whether the predicted probability for a ground truth label is1.0
or0.1does not matter, as long as the class is present. Hence, a
0,00
0,23
0,45
0,68
0,90
vgg16 vgg19
ResNet50
ResNet101
notebook laptop

The World View Depicted in ILSVRC2012
• Not representative of the real world
• > 100 sub-species of dogs (5 cats)
• 1 red wine (no white wine)
• 1 Granny Smith (no other apples)
• 1 carbonara (no other pasta)
32

Carbonara
33
The ImageNet view:
equivalent exemplars
The Italian view

ImageNet’s Origins
34
Shankar et al. - No classification without representation:
Assessing geodiversity issues in open data sets for the developing world
https://arxiv.org/abs/1711.08536

Cultural Concepts
35

Cultural Concepts
36

What Is the Representative Sample?
•Criticism in psychology: samples often drawn entirely from Western, Educated,
Industrialized, Rich, and Democratic (WEIRD) societies
•“our review of the comparative database from across the behavioral sciences suggests
both that there is substantial variability in experimental results across populations and that
WEIRD subjects are particularly unusual compared with the rest of the species”
•“The findings suggest that members of WEIRD societies, including young children, are
among the least representative populations one could find for generalizing about humans.”
37
https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/weirdest-people-in-the-world/
BF84F7517D56AFF7B7EB58411A554C17

Non-Visual (and Stereotypical) Concepts
38
Bad Person, Call Girl, Drug Addict, Closet
Queen, Convict, Crazy, Failure, Flop, Fucker,
Hypocrite, Jezebel, Kleptomaniac, Loser,
Melancholic, Nonperson, Pervert, Prima
Donna, Schizophrenic, Second-Rater, Spinster,
Streetwalker, Stud, Tosser, Unskilled Person,
Wanton, Waverer, and Wimp
Is there an [X] in this image?
Crawford & Paglan - excavating.ai

Validity
39

Psychometrics
40
•Measuring constructs, which are not directly observable
•The measurement instrument is also known as a
psychological test (warning: vocabulary clash with software)

An Instrument Is Sound…
41
•…if it is valid
•…and if it is reliable

Reliability
42
•Internal consistency
•Test-retest reliability
•In case of subjective tests:
•Inter-rater reliability
•Intra-rater reliability

Validity vs. Reliability
43

Construct Validity
44
•Extent to which variables of an experiment correspond to the
theoretical meaning of the concept they purport to measure.

A Famous Violation of Construct Validity
45

‘Horse’ Systems
46
•Do not actually address the problem they
appear to be solving
•Only a ‘horse’ in relation to a specific problem
•Hence, a ‘horse’ for one problem may not be
one for another:
•‘Reproduce ground truth by XYZ’ vs.
•‘Reproduce ground truth by any means’
Bob Sturm

Content Validity
47
•Extent to which the experimental units reflect and represent the
elements of the domain under study.

Criterion Validity
48
•Extent to which results of an experiment are correlated with
those of other experiments already known to be valid.
•Concurrent: how does a new test/measurement compare against
a validated test/measurement?
•Predictive: how well does a test/measurement predict a future
outcome?

This Is a Validated Instrument
49
•‘Big Five’ Personality
•Openness
•Conscientiousness
•Extraversion
•Agreeableness
•Neuroticism

This Is Not a Validated Instrument
50
•Myers-Briggs

This Is Not a Validated Instrument
51
Liem et al. - Psychology Meets Machine Learning:
Interdisciplinary Perspectives on Algorithmic Job
Candidate Screening
https://repository.tudelft.nl/islandora/object/
uuid%3Ab27e837a-4844-4745-b56d-3efe94c61f0f

Consequences
52
Liem et al. - Psychology Meets Machine Learning:
Interdisciplinary Perspectives on Algorithmic Job
Candidate Screening
https://repository.tudelft.nl/islandora/object/
uuid%3Ab27e837a-4844-4745-b56d-3efe94c61f0f

Another Problematic Example
53
Wu & Zhang - Automated Inference on
Criminality using Face Images
https://arxiv.org/abs/1611.04135

Challenges of Multimedia Data
•Raw data: many numbers
•44.1 kHz audio: 44,100 measurements per second
•Record me for 45 mins: 119,070,000 measurements
•RGB images: 224 x 224 x 3 pixels = 150,528 intensity values
54

Challenges of Multimedia Data
55

Can’t Trust the Feeling?
How Open Data Reveals Unexpected
Behavior of High-Level Music Descriptors
Cynthia C. S. Liem and Chris Mostert
56
https://archives.ismir.net/ismir2020/paper/000137.pdf

Automatic Music Description
•Critical to content-based music information retrieval
•Only way for non-content owners to perform large-scale research
•Leading to Grander Statements on the Nature of Music
57

But Can We Trust the Descriptors?
•Successful performance reported in papers.
•How does this extend to ‘in-the-wild’ situations?
58

AcousticBrainz
•Community locally computes descriptor values, using open-
source Essentia library.
•Submissions (with metadata) collected per MusicBrainz
Recording ID.
•High-level descriptors are machine learning-based, and include
classifier confidence.
59

AcousticBrainz
•Anyone can submit anything…so we don’t know what the
output should be?
•In psychology and software engineering, ‘testing’ can go beyond
‘known truths’, exploiting known relationships.
60

Multiple Recording Submissions
•Inspired by software testing (derived oracles / differential testing)
•If only the codec changes, songs remain semantically equivalent.
•One would assume
classify_c(my_preprocessing(m)) ==
classify_c(your_preprocessing(m))
61

Not Quite!
62

‘Constructs Known To Relate’
•Inspired by psychological testing (construct validity)
•Same input is run through multiple classifiers, targeting the same
concept.
63

‘Constructs Known To Relate’
64
•genre_rosamerica classifier was 90.74 % accurate on
rock.
•genre_tzanetakis classifier was 60 % accurate on rock.
•Pearson correlation between (genre_rosamerica, rock)
and (genre_tzanetakis, rock ) classifications in
Acousticbrainz:

‘Constructs Known To Relate’
65
•genre_rosamerica classifier was 90.74 % accurate on
rock.
•genre_tzanetakis classifier was 60 % accurate on rock.
•Pearson correlation between (genre_rosamerica, rock)
and (genre_tzanetakis, rock ) classifications in
Acousticbrainz: -0.07

‘Constructs Known To Relate’
66

Strange Confidence Distributions
67

Strange Confidence Distributions
68
•Peak vs. non-peak distributional differences are especially large
for bit rate, codec and low-level extractor software versions.
•We hardly consider these in high-level descriptor evaluation!

What Can We Do?
69

Better Articulation of Underlying Assumptions
70
•Are there any assumptions of underlying distributions, and are
they actually met?
•What is ‘the universe’ that should be represented?

Better Awareness & Standards on
Measurement and Annotation
71
•https://conjointly.com/kb/measurement-in-research/
•Aroyo & Welty - Truth Is a Lie: Crowd Truth and the Seven Myths of Human
Annotation https://ojs.aaai.org/index.php/aimagazine/article/view/2564
•Welty, Paritosh & Aroyo, “Metrology for AI: From Benchmarks to Instruments”,
https://arxiv.org/abs/1911.01875
•Jacobs and Wallach, “Measurement and Fairness”, https://dl.acm.org/doi/
10.1145/3442188.3445901

Better Documentation
72
•Often inspired by data provenance in databases
•Complements to Data Protection Impact Assessments
•Gebru et al., Datasheets for Datasets, https://arxiv.org/abs/1803.09010
•Jo & Gebru, Lessons from Archives: strategies for collecting
sociocultural data in machine learning, https://dl.acm.org/doi/abs/
10.1145/3351095.3372829

Stronger Requirements
73
•“The AI should classify images of dogs”

vs.
•“The system should return true for photographs containing household-dogs.
Other similar species, such as wolves, should return false. Images that contain
dogs, but other items as well, should return true.”
•Ahmad et al., What’s up with Requirements Engineering for Artificial
Intelligence Systems? https://raw.githubusercontent.com/nzjohng/publications/
master/papers/re2021_1.pdf
•More in upcoming lectures

Automated Tooling
74
•Northcutt et al., Confident Learning: Estimating Uncertainty in
Dataset Labels, https://jair.org/index.php/jair/article/view/
12125/26676 | https://github.com/cleanlab/cleanlab
•Breck et al., Data Validation for Machine Learning, https://
mlsys.org/Conferences/2019/doc/2019/167.pdf

Be Aware of Researcher Degrees of Freedom
•We have some flexibility in data collection and analysis (e.g.
choices of normalization, hyperparameters, etc.)
•This may actually affect results and final conclusions!
75
Simmons et al. - False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting
Anything as Significant https://journals.sagepub.com/doi/10.1177/0956797611417632
McFee et al. - Open-Source Practices for Music Signal Processing Research: Recommendations for Transparent,
Sustainable, and Reproducible Audio Research https://sinc-lab.com/files/mcfee2019opensource.pdf
Kim et al. - Beyond Explicit Reports: Comparing Data-Driven Approaches to Studying Underlying Dimensions of
Music Preference https://dl.acm.org/doi/10.1145/3320435.3320462
Liem and Panichella - Run, Forest, Run? On Randomization and Reproducibility in Predictive Software Engineering
https://arxiv.org/abs/2012.08387

Further Translations of Testing Concepts?
76
•Software: Coverage? Input diversity? Edge cases?
•Psychology: Further equivalents to validity assessment?

Articulation of Desired Policy?
77
•To be discussed in upcoming lectures on fairness

For Now - Think of These Questions in
Connection to the Assignment Dataset
78
•What would make for a ‘better’ or ‘worse’ dataset?
•If you could test this data more thoroughly, what would you test
for?
Tags