How can we evaluate a portfolio of algorithms to extract meaningful interpretations about them? Suppose we have a set of algorithms. These can be classification, regression, clustering or any other type of algorithm. And suppose we have a set of problems that these algorithms can work on. We can evaluate these algorithms on the problems and get the results. From these results, can we explain the algorithms in a meaningful way? To find an answer to this question we turn to social sciences. Methodologies in social sciences focus on explanations as opposed to accurate predictions.
Item Response Theory (IRT) is a methodology in educational psychometrics that is used to design, analyse and score test questions and questionnaires. IRT can measure hidden qualities such as stress proneness, political inclinations, or verbal/mathematical ability. Participants take tests and IRT is used to determine the ability of participants and discrimination and difficulty of test questions. In this talk we use a novel mapping of the traditional IRT framework modified to the algorithm evaluation domain. Using this new mapping, we elicit a richer suite of characteristics including stability and anomalousness that describe important aspects of algorithm performance. We find the strengths and weaknesses of algorithms in the problem space. Using the algorithm strengths and weaknesses we construct a smaller portfolio of algorithms that gives good performance.
1. Using educational
models to explain
algorithm
evaluation
Sevvandi Kandanaarachchi
Work with Kate Smith-Miles
1
2. What is an explanation?
โข โTo explain an event is to provide some information about its
causal history.โ โ Lewis , 1986 (Causal Explanation)
โข โA statement or an account that makes something clearโ โ
Google
โข โIt is important to note that the solution to explainable AI is not
just โmore AIโ โ - Miller, 2019
2
3. Explanation in the social sciences
โข Miller (2019) argues for Social Science + Computer Science in XAI
โIn the fields of philosophy, cognitive psychology/science, and social
psychology, there is a vast and mature body of work that studies these
exact topics. For millennia, philosophers have asked the questions
about what constitutes an explanation, what is the function of
explanations, and what are their structure. For over 50 years, cognitive
and social psychologists have analysed how people attribute and
evaluate the social behaviour of others in physical environments. For
over two decades, cognitive psychologists and scientists have
investigated how people generate explanations and how they evaluate
their quality.
I argue here that there is considerable scope to infuse this valuable
body of research into explainable AI.โ
3
5. What is algorithm evaluation?
โข Performance of many algorithms to many problems
โข How do you explain the algorithm performance?
โข Standard statistical analysis misses many things
5
Algo 1 Algo 2 Algo 3 Algo 4
Problem 1
Problem 2
Problem 3
Problem 4
Problem 5
6. We want to evaluate algorithms
performance in a way that we
understand algorithms and problems
better!
6
8. Item Response Theory
(IRT)
โข Models used in social
sciences/psychometrics
โข Unobservable characteristics and observed
outcomes
โข Verbal or mathematical ability
โข Racial prejudice or stress proneness
โข Political inclinations
โข Intrinsic โqualityโ that cannot be
measured directly
This Photo by Unknown Author is licensed under CC BY-SA
9. IRT in education
โข Finds the discrimination and difficulty of test
questions
โข And the ability of the test participants
โข By fitting an IRT model
โข In education โ questions that can discriminate
between students with different ability is
preferred to โvery difficultโ questions.
9
10. How it works
10
Questions
Students Q 1 Q 2 Q 3 Q 4
Stu 1 0.95 0.87 0.67 0.84
Stu 2 0.57 0.49 0.78 0.77
Stu n 0.75 0.86 0.57 0.45
IRT Model
Discrimination of
questions ๐ผ๐
Difficulty of questions ๐ฝ๐
Ability of students ๐๐ (latent trait)
Matrix ๐๐ร๐
11. What does IRT give us?
โข Q1 - discrimination ๐ผ1,difficulty ๐ฝ1
โข Q2 - discrimination ๐ผ2 ,difficulty ๐ฝ2
โข Q3 - discrimination ๐ผ3 ,difficulty ๐ฝ3
โข Q4 - discrimination ๐ผ4 ,difficulty ๐ฝ4
โข Student 1 ability ๐1
๏ธ
โข Student n ability ๐๐
11
Q 1 Q 2 Q 3 Q 4
Stu 1 0.95 0.87 0.67 0.84
Stu 2 0.57 0.49 0.78 0.77
Stu n 0.75 0.86 0.57 0.45
12. The causal understanding
12
๐ผ๐ ๐ฝ๐ ๐๐
๐ฅ๐๐
Discrimination of Q j Difficulty of Q j Ability of student i
Marks of student i for Question j
Student
Question
Marks
13. IRT in Data Science/Machine Learning
โข Relatively new area of research
โข Seminal paper
โข 2019 - Item response theory in AI: Analysing machine learning
classifiers at the instance level โ F. Martรญnez-Plumed et al.
13
14. Dichotomous IRT
โข Multiple choice
โข True or false
โข ๐ ๐ฅ๐๐ = 1 ๐๐, ๐ผ๐, ๐๐, ๐พ๐ = ๐พ๐ +
1 โ๐พ๐
1+exp(โ๐ผ๐(๐๐โ๐๐))
โข ๐ฅ๐๐ - outcome/score of examinee ๐ for item ๐
โข ๐๐ - examineeโs (๐) ability
โข ๐พ๐ - guessing parameter for item ๐
โข ๐๐ - difficulty parameter
โข ๐ผ๐ - discrimination
This Photo by Unknown Author is licensed under CC BY-NC
14
15. Continuous IRT
โข Grades out of 100
โข A 2D surface of probabilities
โข ๐(๐ฅ๐๐|๐๐, ๐๐, ๐ผ๐)
15
๐(๐ฅ๐๐|๐๐, ๐๐, ๐ผ๐)
16. Continuous IRT
โข ๐(๐ฅ๐๐|๐๐, ๐๐, ๐ผ๐)
โข At a fixed ๐
16
๐(๐ฅ๐๐|๐๐, ๐๐, ๐ผ๐)
๐ = 2
๐ = โ3,
x axis is the normalized score
19. The new mapping
19
๐ผ๐ ๐ฝ๐ ๐๐
๐ฅ๐๐
Discrimination of Algo j Difficulty of Algo j Ability of Dataset i
Marks of dataset i for algorithm j
Dataset
Algorithm
Performance
20. What happens to the IRT parameters?
โข IRT - ๐๐ - ability of student ๐
โข As ๐ increases probability of a
higher score increases
โข What is ๐๐, in terms of a
dataset?
โข ๐๐ easiness of the dataset
โข ๐ฟ๐ = โ๐๐
โข ๐ฟ๐ Dataset difficulty score
20
21. Discrimination parameter
โข Discrimination of item ๐ = ๐ผ๐
โข ๐ผ๐increases โ slope of curve
increases
โข What is ๐ผ๐, in terms of an
algorithm?
โข ๐ผ๐- lack of stability/robustness
of algo
โข (1/|๐ผ๐|) Consistency of algo
21
22. Consistent algorithms
โข Education โ such a question
doesnโt give any information
โข Algorithms โ these algorithms
are really stable or consistent
โข Consistency = 1/|๐ผ๐|
22
23. Anomalous algorithms
โข Algorithms that perform poorly
on easy datasets and well on
difficult datasets
โข Negative discrimination
โข In education โ such items are
discarded or revised
โข If an algorithm anomalous, it is
interesting
โข Anomalousness = sign(๐ผ๐)
This Photo by Unknown Author is licensed under CC BY-NC-ND
23
25. After fitting the model, we get . . .
โข Algorithm metrics
โข Consistency, difficulty limit, anomalousness indicator
โข Dataset metrics
โข Difficulty scores
25
26. What can we say . . .
โข Consistent algorithms give similar performance for easy or hard
datasets
โข Algorithms with higher difficulty limits can handle harder problems
โข Anomalous algorithms give bad performance for easy problems and
good performance for difficult problems
26
27. Example โ fitting IRT model
โข Graph Colouring algorithms (Smith-Miles et al 2014)
โข 8 graph colouring algorithms
โข RandomGreedy, DSATUR, Bktr, HillClimber, HEA, PartialCol, TabuCol, AntCol
โข How many colours did each algorithm use to colour 6712 graphs
27
31. Focusing on algorithm performance
โข We have algorithm performance (y axis)
and problem difficulty (x axis)
โข We can fit a model and find how each
algorithm performs
โข We use smoothing splines
โข Can visualize them
โข No parameters to specify
31
33. Strengths and weaknesses of algorithms
โข If the curve is on top โ it is strong in that region
โข If the curve is at bottom โ weak in that region
โข โ๐โ ๐ฟ = max
๐
โ๐(๐ฟ) - this is the best performance for a given
difficulty
โข ๐ โ is the best algorithm for that difficulty
โข ๐ ๐ก๐๐๐๐๐กโ๐ ๐, ๐ = { ๐ฟ: โ๐ ๐ฟ โ โ๐โ ๐ฟ โค ๐}
โข We give ๐ leeway
โข Weaknesses are similar
33
36. AIRT performs well, when . . .
โข The set of algorithms is diverse.
โข Ties back to IRT basics
โข IRT in education โ If all the questions are equally discriminative and
difficult, IRT doesnโt add much
โข IRT useful when we have a diverse set of questions and we want to
know
โข Which questions are more discriminative
โข Which questions are difficult
36
37. Summary
โข Understand more about algorithms
โข Anomalousness, consistency, difficulty limit
โข Visualise the strengths and weaknesses of algorithms
โข Select a portfolio of good algorithms
โข Includes diagnostics
โข R package airt (on CRAN)
โข https://sevvandi.github.io/airt/
โข Pre-print: http://bit.ly/algorithmirt
โข Comprehensive Algorithm Portfolio Evaluation using Item Response Theory
โข More applications included
37
39. Algorithm portfolio selection
โข Can use algorithm strengths to select a good portfolio of algorithms
โข We call this portfolio airt portfolio
โข airt โ Algorithm IRT (old Scottish word โ to guide)
โข ๐๐๐๐ก ๐๐๐๐ก๐๐๐๐๐ ๐ = ๐ ๐๐ก ๐๐ ๐๐๐๐๐๐๐กโ๐๐ ๐ค๐๐กโ ๐ ๐ก๐๐๐๐๐กโ๐ (๐)
39
40. Evaluating this portfolio
โข Let ๐ denote a problem, ๐ a portfolio of algorithms, ๐น the full set of
algorithms
โข ๐๐๐๐๐๐๐๐๐๐๐. ๐๐๐ก๐๐๐๐๐๐๐ก๐๐๐ ๐, ๐ = ๐๐๐ ๐ก. ๐๐๐๐๐๐๐๐๐๐๐ ๐, ๐น โ
๐๐๐ ๐ก. ๐๐๐๐๐๐๐๐๐๐๐(๐, ๐)
Graph colouring example
40
43. โข SAT11-individual
example
โข Example where
airt doesnโt do
so well.
โข The curves are
quite similar to
each other
โข Reason to
believe SAT11
has pre-selected
algorithms
43