GECCO 2024 Tutorial on Algorithm Evaluation using Item Response Theory

SevvandiKandanaarach 70 views 42 slides Aug 23, 2024

Slide 1 of 42

About This Presentation

Size: 16.25 MB

Language: en

Added: Aug 23, 2024

Slides: 42 pages

Slide Content

Algorithm Evaluation using Item Response Theory Sevvandi Kandanaarachchi Work with Kate Smith-Miles GECCO 2024 1 Kandanaarachchi, Sevvandi, and Kate Smith-Miles. "Comprehensive algorithm portfolio evaluation using item response theory." Journal of Machine Learning Research 24.177 (2023): 1-52.

To explain or to predict? – Galit Shmueli Paper by Shmueli in 2010 Talks about these two topics Argues that explanation and prediction are different Two modelling paths – for predicting and explaining Social sciences have done explanatory models for a long time

What is an explanation? “To explain an event is to provide some information about its causal history.” – Lewis, 1986 (Causal Explanation) “A statement or an account that makes something clear” – Google “It is important to note that the solution to explainable AI is not just ‘more AI’ ” Miller, 2019 Miller (2019) argues for Social Science + Computer Science in XAI “ In the fields of philosophy, cognitive psychology/science, and social psychology, there is a vast and mature body of work that studies these exact topics.” 3

Message: Bring in the social scientists to the party! Integrate their methods! 4

5 Hello Item Response Theory

Item Response Theory (IRT) Models used in social sciences/psychometrics Unobservable characteristics and observed outcomes Verbal or mathematical ability Racial prejudice or stress proneness Political inclinations Intrinsic “quality” that cannot be measured directly 6 This Photo by Unknown Author is licensed under CC BY-SA

IRT in education Finds the discrimination and difficulty of test questions And the ability of the test participants By fitting an IRT model In education – questions that can discriminate between students with different ability is preferred to “very difficult” questions. 7

How it works 8 Questions Students Q 1 Q 2 Q 3 Q 4 Stu 1 0.95 0.87 0.67 0.84 Stu 2 0.57 0.49 0.78 0.77 Stu n 0.75 0.86 0.57 0.45 IRT Model Discrimination of questions Difficulty of questions Ability of students (latent trait) Matrix

What does IRT give us? Q1 - discrimination , difficulty Q2 - discrimination , difficulty Q3 - discrimination , difficulty Q4 - discrimination , difficulty Student 1 ability : Student n ability Q 1 Q 2 Q 3 Q 4 Stu 1 0.95 0.87 0.67 0.84 Stu 2 0.57 0.49 0.78 0.77 Stu n 0.75 0.86 0.57 0.45 9

The causal understanding Discrimination of Q j Difficulty of Q j Ability of student i Marks of student i for Question j Student Question Marks 10

Dichotomous IRT Multiple choice True or false - outcome/score of examinee for item - examinee’s ability - guessing parameter for item - difficulty parameter - discrimination This Photo by Unknown Author is licensed under CC BY-NC 11

Continuous IRT Grades out of 100 A 2D surface of probabilities 12

Mapping algorithm evaluation to IRT IRT Model Students Test questions 13 Problems/datasets Algorithms IRT Model Algorithms Datasets/Problems

The new mapping - AIRT Discrimination of Algo j Difficulty of Algo j Ability of Dataset i Marks of dataset i for algorithm j Dataset Algorithm Performance 14

Fitting the IRT model Maximising the expectation - discrimination parameter of algorithm - scaling parameter for the algorithm - difficulty parameter for the algorithm - score of the algorithm on the dataset/problem - prior probabilities

What happens to the IRT parameters? IRT - ability of student As increases probability of a higher score increases What is in terms of a dataset? easiness of the dataset Dataset difficulty score 16

Discrimination parameter Discrimination of item increases slope of curve increases What is in terms of an algorithm? - lack of stability/robustness of algo Consistency of algo 17

Consistent algorithms Education – such a question doesn’t give any information Algorithms – these algorithms are really stable or consistent Consistency = 18

Anomalous algorithms Algorithms that perform poorly on easy datasets and well on difficult datasets Negative discrimination In education – such items are discarded or revised If an algorithm anomalous, it is interesting Anomalousness This Photo by Unknown Author is licensed under CC BY-NC-ND 19

Meaning of AIRT parameters? Discrimination -> anomalousness, algorithm consistency Difficulty -> algorithm difficulty limit Student ability -> Dataset difficulty spectrum

What can we say . . . Consistent algorithms give similar performance for easy or hard datasets Algorithms with higher difficulty limits can handle harder problems Anomalous algorithms give bad performance for easy problems and good performance for difficult problems 21

Dataset difficulty Dataset difficulty Is a function of discrimination, difficulty and scores

More on dataset difficulty Order by dataset difficulty See how the actual performance is for given difficulty values Fenella McAndrew, "Adiabatic Quantum Computing to solve the MaxCut graph problem", Master of Science (Mathematics and Statistics) dissertation, The University of Melbourne, 2020 (supervised by Kate Smith-Miles and Charles Hill).

Strengths and weaknesses Draw splines If the curve is on top it is strong If the curve is at the bottom, it is weak

Fitting AIRT models Shiny App https://sevvandi.shinyapps.io/AIRT/ Shiny App developed by Brodie Oldfield at CSIRO’s Data61 R package airt (on CRAN) https://sevvandi.github.io/airt/ Caption

Example – fitting AIRT model Anomaly detection algorithms 8 anomaly detection algorithms 12338 datasets 26

Anomaly detection algorithms 27 Algorithm Consistency Difficulty Limit Anomalous Ensemble 55.08 -66.55 FALSE LOF 4.50 5.10 FALSE KNN 1.72 2.30 FALSE FAST_ABOD 9.39 10.23 FALSE Isolation Forest 2.35 3.05 FALSE KDEOS_1 0.86 -0.31 TRUE KDEOS-2 1.16 -0.51 TRUE LDF 2.01 2.08 FALSE

Dataset difficulty spectrum IRT Student Ability - > Dataset Difficulty 28

Focusing on algorithm performance We have algorithm performance (y axis) and problem difficulty (x axis) We can fit a model and find how each algorithm performs We use smoothing splines Can visualize them No parameters to specify 29

30 Strengths and weaknesses If the curve is on top – it is strong in that region If the curve is at bottom – weak in that region

AIRT IRT Question and student characteristics Algorithm and dataset characteristics Visualise strengths and weaknesses Portfolio construction 31

Limitations and extensions of AIRT . . . The set of algorithms is diverse. IRT in education – If all the questions are equally discriminative and difficult, IRT doesn’t add much IRT useful when we have a diverse set of questions and we want to know Which questions are more discriminative Which questions are difficult 1D problem difficulty spectrum – extend it to 2D? One performance metric – add multiple metrics Performance, fairness? Happy to collaborate.  32

Summary Understand more about algorithms Anomalousness, consistency, difficulty limit Visualise the strengths and weaknesses of algorithms Shiny App at https://sevvandi.shinyapps.io/AIRT/ R package airt (on CRAN) https://sevvandi.github.io/airt/ Paper: https://jmlr.org/papers/v24/20-1318.html Comprehensive Algorithm Portfolio Evaluation using Item Response Theory 33

34 Time to play with the Shiny App https://sevvandi.shinyapps.io/AIRT/

New AIRT parameters Where M, V and C denote mean, variance, and covariance terms

New AIRT parameters

Algorithm portfolio selection Can use algorithm strengths to select a good portfolio of algorithms We call this portfolio airt portfolio airt – Algorithm IRT (old Scottish word – to guide) 37

AIRT framework Performance data IRT Model Algorithm and dataset metrics Fitting smoothing splines Algorithm strengths and weaknesses Airt portfolio Evaluate portfolios General block AIRT output 38

What happens to the IRT parameters? IRT - ability of student As increases probability of a higher score increases What is in terms of a dataset? easiness of the dataset Dataset difficulty score 39

Discrimination parameter Discrimination of item increases slope of curve increases What is in terms of an algorithm? - lack of stability/robustness of algo Consistency of algo 40

Consistent algorithms Education – such a question doesn’t give any information Algorithms – these algorithms are really stable or consistent Consistency = 41

Anomalous algorithms Algorithms that perform poorly on easy datasets and well on difficult datasets Negative discrimination In education – such items are discarded or revised If an algorithm anomalous, it is interesting Anomalousness = This Photo by Unknown Author is licensed under CC BY-NC-ND 42

GECCO 2024 Tutorial on Algorithm Evaluation using Item Response Theory

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

GECCO 2024 Tutorial on Algorithm Evaluation using Item Response Theory

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx