GECCO 2024 Tutorial on Algorithm Evaluation using Item Response Theory

SevvandiKandanaarach 70 views 42 slides Aug 23, 2024
Slide 1
Slide 1 of 42
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42

About This Presentation

GECCO 2024 Tutorial on Algorithm Evaluation using Item Response Theory


Slide Content

Algorithm Evaluation using Item Response Theory Sevvandi Kandanaarachchi Work with Kate Smith-Miles GECCO 2024 1 Kandanaarachchi, Sevvandi, and Kate Smith-Miles. "Comprehensive algorithm portfolio evaluation using item response theory."  Journal of Machine Learning Research  24.177 (2023): 1-52.

To explain or to predict? – Galit Shmueli Paper by Shmueli in 2010 Talks about these two topics Argues that explanation and prediction are different Two modelling paths – for predicting and explaining Social sciences have done explanatory models for a long time

What is an explanation? “To explain an event is to provide some information about its causal history.” – Lewis, 1986 (Causal Explanation)  “A statement or an account that makes something clear” – Google “It is important to note that the solution to explainable AI is not just ‘more AI’ ” Miller, 2019 Miller (2019) argues for Social Science + Computer Science in XAI “ In the fields of philosophy, cognitive psychology/science, and social psychology, there is a vast and mature body of work that studies these exact topics.” 3

Message: Bring in the social scientists to the party! Integrate their methods! 4

5 Hello Item Response Theory

Item Response Theory (IRT) Models used in social sciences/psychometrics Unobservable characteristics and observed outcomes Verbal or mathematical ability Racial prejudice or stress proneness Political inclinations Intrinsic “quality” that cannot be measured directly 6 This Photo by Unknown Author is licensed under CC BY-SA

IRT in education Finds the discrimination and difficulty of test questions And the ability of the test participants By fitting an IRT model In education – questions that can discriminate between students with different ability is preferred to “very difficult” questions. 7

How it works 8 Questions Students Q 1 Q 2 Q 3 Q 4 Stu 1 0.95 0.87 0.67 0.84 Stu 2 0.57 0.49 0.78 0.77 Stu n 0.75 0.86 0.57 0.45 IRT Model Discrimination of questions Difficulty of questions Ability of students (latent trait) Matrix      

What does IRT give us? Q1 - discrimination , difficulty Q2 - discrimination , difficulty Q3 - discrimination , difficulty Q4 - discrimination , difficulty Student 1 ability : Student n ability   Q 1 Q 2 Q 3 Q 4 Stu 1 0.95 0.87 0.67 0.84 Stu 2 0.57 0.49 0.78 0.77 Stu n 0.75 0.86 0.57 0.45 9

The causal understanding         Discrimination of Q j Difficulty of Q j Ability of student i Marks of student i for Question j Student Question Marks 10

Dichotomous IRT Multiple choice True or false - outcome/score of examinee for item - examinee’s ability - guessing parameter for item - difficulty parameter - discrimination   This Photo by Unknown Author is licensed under CC BY-NC 11

Continuous IRT Grades out of 100 A 2D surface of probabilities   12

Mapping algorithm evaluation to IRT IRT Model Students Test questions 13 Problems/datasets Algorithms IRT Model Algorithms Datasets/Problems

The new mapping - AIRT         Discrimination of Algo j Difficulty of Algo j Ability of Dataset i Marks of dataset i for algorithm j Dataset Algorithm Performance 14

Fitting the IRT model Maximising the expectation - discrimination parameter of algorithm - scaling parameter for the algorithm - difficulty parameter for the algorithm - score of the algorithm on the dataset/problem - prior probabilities  

What happens to the IRT parameters? IRT - ability of student As increases probability of a higher score increases What is in terms of a dataset? easiness of the dataset Dataset difficulty score   16

Discrimination parameter Discrimination of item increases slope of curve increases What is in terms of an algorithm? - lack of stability/robustness of algo Consistency of algo   17

Consistent algorithms Education – such a question doesn’t give any information Algorithms – these algorithms are really stable or consistent Consistency =   18

Anomalous algorithms Algorithms that perform poorly on easy datasets and well on difficult datasets Negative discrimination In education – such items are discarded or revised If an algorithm anomalous, it is interesting Anomalousness   This Photo by Unknown Author is licensed under CC BY-NC-ND 19

Meaning of AIRT parameters? Discrimination -> anomalousness, algorithm consistency Difficulty -> algorithm difficulty limit Student ability -> Dataset difficulty spectrum

What can we say . . . Consistent algorithms give similar performance for easy or hard datasets Algorithms with higher difficulty limits can handle harder problems Anomalous algorithms give bad performance for easy problems and good performance for difficult problems 21

Dataset difficulty Dataset difficulty Is a function of discrimination, difficulty and scores  

More on dataset difficulty Order by dataset difficulty See how the actual performance is for given difficulty values Fenella McAndrew, "Adiabatic Quantum Computing to solve the MaxCut graph problem", Master of Science (Mathematics and Statistics) dissertation, The University of Melbourne, 2020 (supervised by Kate Smith-Miles and Charles Hill).

Strengths and weaknesses Draw splines If the curve is on top it is strong If the curve is at the bottom, it is weak

Fitting AIRT models Shiny App https://sevvandi.shinyapps.io/AIRT/ Shiny App developed by Brodie Oldfield at CSIRO’s Data61 R package airt (on CRAN) https://sevvandi.github.io/airt/ Caption

Example – fitting AIRT model Anomaly detection algorithms 8 anomaly detection algorithms 12338 datasets 26

Anomaly detection algorithms 27 Algorithm Consistency Difficulty Limit Anomalous Ensemble 55.08 -66.55 FALSE LOF 4.50 5.10 FALSE KNN 1.72 2.30 FALSE FAST_ABOD 9.39 10.23 FALSE Isolation Forest 2.35 3.05 FALSE KDEOS_1 0.86 -0.31 TRUE KDEOS-2 1.16 -0.51 TRUE LDF 2.01 2.08 FALSE

Dataset difficulty spectrum IRT Student Ability - > Dataset Difficulty 28

Focusing on algorithm performance We have algorithm performance (y axis) and problem difficulty (x axis) We can fit a model and find how each algorithm performs We use smoothing splines Can visualize them No parameters to specify 29

30 Strengths and weaknesses If the curve is on top – it is strong in that region If the curve is at bottom – weak in that region

AIRT IRT Question and student characteristics Algorithm and dataset characteristics Visualise strengths and weaknesses Portfolio construction 31

Limitations and extensions of AIRT . . . The set of algorithms is diverse. IRT in education – If all the questions are equally discriminative and difficult, IRT doesn’t add much IRT useful when we have a diverse set of questions and we want to know Which questions are more discriminative Which questions are difficult 1D problem difficulty spectrum – extend it to 2D? One performance metric – add multiple metrics Performance, fairness? Happy to collaborate.  32

Summary Understand more about algorithms Anomalousness, consistency, difficulty limit Visualise the strengths and weaknesses of algorithms Shiny App at https://sevvandi.shinyapps.io/AIRT/ R package airt (on CRAN) https://sevvandi.github.io/airt/ Paper: https://jmlr.org/papers/v24/20-1318.html Comprehensive Algorithm Portfolio Evaluation using Item Response Theory 33

34 Time to play with the Shiny App https://sevvandi.shinyapps.io/AIRT/

New AIRT parameters Where M, V and C denote mean, variance, and covariance terms  

New AIRT parameters  

Algorithm portfolio selection Can use algorithm strengths to select a good portfolio of algorithms We call this portfolio airt portfolio airt – Algorithm IRT (old Scottish word – to guide) 37

AIRT framework Performance data IRT Model Algorithm and dataset metrics Fitting smoothing splines Algorithm strengths and weaknesses Airt portfolio Evaluate portfolios General block AIRT output 38

What happens to the IRT parameters? IRT - ability of student As increases probability of a higher score increases What is in terms of a dataset? easiness of the dataset Dataset difficulty score   39

Discrimination parameter Discrimination of item increases slope of curve increases What is in terms of an algorithm? - lack of stability/robustness of algo Consistency of algo   40

Consistent algorithms Education – such a question doesn’t give any information Algorithms – these algorithms are really stable or consistent Consistency =   41

Anomalous algorithms Algorithms that perform poorly on easy datasets and well on difficult datasets Negative discrimination In education – such items are discarded or revised If an algorithm anomalous, it is interesting Anomalousness =   This Photo by Unknown Author is licensed under CC BY-NC-ND 42