GECCO 2024 Tutorial on Algorithm Evaluation using Item Response Theory
SevvandiKandanaarach
70 views
42 slides
Aug 23, 2024
Slide 1 of 42
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
About This Presentation
GECCO 2024 Tutorial on Algorithm Evaluation using Item Response Theory
Size: 16.25 MB
Language: en
Added: Aug 23, 2024
Slides: 42 pages
Slide Content
Algorithm Evaluation using Item Response Theory Sevvandi Kandanaarachchi Work with Kate Smith-Miles GECCO 2024 1 Kandanaarachchi, Sevvandi, and Kate Smith-Miles. "Comprehensive algorithm portfolio evaluation using item response theory." Journal of Machine Learning Research 24.177 (2023): 1-52.
To explain or to predict? – Galit Shmueli Paper by Shmueli in 2010 Talks about these two topics Argues that explanation and prediction are different Two modelling paths – for predicting and explaining Social sciences have done explanatory models for a long time
What is an explanation? “To explain an event is to provide some information about its causal history.” – Lewis, 1986 (Causal Explanation) “A statement or an account that makes something clear” – Google “It is important to note that the solution to explainable AI is not just ‘more AI’ ” Miller, 2019 Miller (2019) argues for Social Science + Computer Science in XAI “ In the fields of philosophy, cognitive psychology/science, and social psychology, there is a vast and mature body of work that studies these exact topics.” 3
Message: Bring in the social scientists to the party! Integrate their methods! 4
5 Hello Item Response Theory
Item Response Theory (IRT) Models used in social sciences/psychometrics Unobservable characteristics and observed outcomes Verbal or mathematical ability Racial prejudice or stress proneness Political inclinations Intrinsic “quality” that cannot be measured directly 6 This Photo by Unknown Author is licensed under CC BY-SA
IRT in education Finds the discrimination and difficulty of test questions And the ability of the test participants By fitting an IRT model In education – questions that can discriminate between students with different ability is preferred to “very difficult” questions. 7
How it works 8 Questions Students Q 1 Q 2 Q 3 Q 4 Stu 1 0.95 0.87 0.67 0.84 Stu 2 0.57 0.49 0.78 0.77 Stu n 0.75 0.86 0.57 0.45 IRT Model Discrimination of questions Difficulty of questions Ability of students (latent trait) Matrix
What does IRT give us? Q1 - discrimination , difficulty Q2 - discrimination , difficulty Q3 - discrimination , difficulty Q4 - discrimination , difficulty Student 1 ability : Student n ability Q 1 Q 2 Q 3 Q 4 Stu 1 0.95 0.87 0.67 0.84 Stu 2 0.57 0.49 0.78 0.77 Stu n 0.75 0.86 0.57 0.45 9
The causal understanding Discrimination of Q j Difficulty of Q j Ability of student i Marks of student i for Question j Student Question Marks 10
Dichotomous IRT Multiple choice True or false - outcome/score of examinee for item - examinee’s ability - guessing parameter for item - difficulty parameter - discrimination This Photo by Unknown Author is licensed under CC BY-NC 11
Continuous IRT Grades out of 100 A 2D surface of probabilities 12
Mapping algorithm evaluation to IRT IRT Model Students Test questions 13 Problems/datasets Algorithms IRT Model Algorithms Datasets/Problems
The new mapping - AIRT Discrimination of Algo j Difficulty of Algo j Ability of Dataset i Marks of dataset i for algorithm j Dataset Algorithm Performance 14
Fitting the IRT model Maximising the expectation - discrimination parameter of algorithm - scaling parameter for the algorithm - difficulty parameter for the algorithm - score of the algorithm on the dataset/problem - prior probabilities
What happens to the IRT parameters? IRT - ability of student As increases probability of a higher score increases What is in terms of a dataset? easiness of the dataset Dataset difficulty score 16
Discrimination parameter Discrimination of item increases slope of curve increases What is in terms of an algorithm? - lack of stability/robustness of algo Consistency of algo 17
Consistent algorithms Education – such a question doesn’t give any information Algorithms – these algorithms are really stable or consistent Consistency = 18
Anomalous algorithms Algorithms that perform poorly on easy datasets and well on difficult datasets Negative discrimination In education – such items are discarded or revised If an algorithm anomalous, it is interesting Anomalousness This Photo by Unknown Author is licensed under CC BY-NC-ND 19
Meaning of AIRT parameters? Discrimination -> anomalousness, algorithm consistency Difficulty -> algorithm difficulty limit Student ability -> Dataset difficulty spectrum
What can we say . . . Consistent algorithms give similar performance for easy or hard datasets Algorithms with higher difficulty limits can handle harder problems Anomalous algorithms give bad performance for easy problems and good performance for difficult problems 21
Dataset difficulty Dataset difficulty Is a function of discrimination, difficulty and scores
More on dataset difficulty Order by dataset difficulty See how the actual performance is for given difficulty values Fenella McAndrew, "Adiabatic Quantum Computing to solve the MaxCut graph problem", Master of Science (Mathematics and Statistics) dissertation, The University of Melbourne, 2020 (supervised by Kate Smith-Miles and Charles Hill).
Strengths and weaknesses Draw splines If the curve is on top it is strong If the curve is at the bottom, it is weak
Fitting AIRT models Shiny App https://sevvandi.shinyapps.io/AIRT/ Shiny App developed by Brodie Oldfield at CSIRO’s Data61 R package airt (on CRAN) https://sevvandi.github.io/airt/ Caption
Example – fitting AIRT model Anomaly detection algorithms 8 anomaly detection algorithms 12338 datasets 26
Focusing on algorithm performance We have algorithm performance (y axis) and problem difficulty (x axis) We can fit a model and find how each algorithm performs We use smoothing splines Can visualize them No parameters to specify 29
30 Strengths and weaknesses If the curve is on top – it is strong in that region If the curve is at bottom – weak in that region
AIRT IRT Question and student characteristics Algorithm and dataset characteristics Visualise strengths and weaknesses Portfolio construction 31
Limitations and extensions of AIRT . . . The set of algorithms is diverse. IRT in education – If all the questions are equally discriminative and difficult, IRT doesn’t add much IRT useful when we have a diverse set of questions and we want to know Which questions are more discriminative Which questions are difficult 1D problem difficulty spectrum – extend it to 2D? One performance metric – add multiple metrics Performance, fairness? Happy to collaborate. 32
Summary Understand more about algorithms Anomalousness, consistency, difficulty limit Visualise the strengths and weaknesses of algorithms Shiny App at https://sevvandi.shinyapps.io/AIRT/ R package airt (on CRAN) https://sevvandi.github.io/airt/ Paper: https://jmlr.org/papers/v24/20-1318.html Comprehensive Algorithm Portfolio Evaluation using Item Response Theory 33
34 Time to play with the Shiny App https://sevvandi.shinyapps.io/AIRT/
New AIRT parameters Where M, V and C denote mean, variance, and covariance terms
New AIRT parameters
Algorithm portfolio selection Can use algorithm strengths to select a good portfolio of algorithms We call this portfolio airt portfolio airt – Algorithm IRT (old Scottish word – to guide) 37
AIRT framework Performance data IRT Model Algorithm and dataset metrics Fitting smoothing splines Algorithm strengths and weaknesses Airt portfolio Evaluate portfolios General block AIRT output 38
What happens to the IRT parameters? IRT - ability of student As increases probability of a higher score increases What is in terms of a dataset? easiness of the dataset Dataset difficulty score 39
Discrimination parameter Discrimination of item increases slope of curve increases What is in terms of an algorithm? - lack of stability/robustness of algo Consistency of algo 40
Consistent algorithms Education – such a question doesn’t give any information Algorithms – these algorithms are really stable or consistent Consistency = 41
Anomalous algorithms Algorithms that perform poorly on easy datasets and well on difficult datasets Negative discrimination In education – such items are discarded or revised If an algorithm anomalous, it is interesting Anomalousness = This Photo by Unknown Author is licensed under CC BY-NC-ND 42