Explainable insights on algorithm performance

Explainable insights
on algorithm
performance
Sevvandi Kandanaarachchi
Work with Kate Smith-Miles
1

To explain or to predict? – Galit Shmueli
● Paper by Shmueli in 2010
● Talks about these two topics
● Argues that explanation and prediction are different
● Two modelling paths – for predicting and explaining
● Social sciences have done explanatory models for a long time

What is an explanation?
• To explain an event is to provide some information about its causal history. –
Lewis, 1986 (Causal Explanation)
• A statement or an account that makes something clear – Google
• It is important to note that the solution to explainable AI is not just ‘more AI’ -
Miller, 2019
• Miller (2019) argues for Social Science + Computer Science in XAI
In the fields of philosophy, cognitive psychology/science, and social
psychology, there is a vast and mature body of work that studies these exact
topics.
3

Message: Bring in the
social scientists to the
party!
Integrate their methods!
4

What is this talk about?
● Using a method in social sciences to do two ML type tasks
● Evaluate algorithms
● It gives us more meaningful metrics about algorithms
● Has some causal interpretations
● Visually inspect where algorithms perform well (and poorly)
● Ensembles
● Anomaly detection ensembles

What is algorithm evaluation?
• Performance of many algorithms to many problems
• How do you explain the algorithm performance?
• Standard statistical analysis misses many things
Algo 1 Algo 2 Algo 3 Algo 4
Problem 1
Problem 2
Problem 3
Problem 4
Problem 5
6

We want to evaluate algorithms
performance . . .
. . . in a way that we understand algorithms
and problems better!
7

Item Response Theory
(IRT)
• Modelsusedinsocialsciences/psychometrics
• Unobservablecharacteristicsandobserved
outcomes
• Verbalormathematicalability
• Racialprejudiceor stressproneness
• Politicalinclinations
• Intrinsic“quality”thatcannotbemeasured
directly
9
This Photo by Unknown Author is licensed under CC BY-SA

IRT in education
• Finds the discrimination and difficulty of test
questions
• And the ability of the test participants
• By fitting an IRT model
• In education – questions that can discriminate
between students with different ability is preferred
to “very difficult” questions.
10

How it works
11
Questions
Students Q 1 Q 2 Q 3 Q 4
Stu 1 0.95 0.87 0.67 0.84
Stu 2 0.57 0.49 0.78 0.77
Stu n 0.75 0.86 0.57 0.45
IRT Model
Discrimination of
questions
Difficulty of questions
Ability of students (latent trait)
Matrix
βj
αj
θi

What does IRT give us?
• Q1 - discrimination , difficulty
• Student 1 ability
︙
• Student n ability
α1 β1
α2 β2
α3 β3
α4 β4
θ1
θn
Q 1 Q 2 Q 3 Q 4
Stu 1 0.95 0.87 0.67 0.84
Stu 2 0.57 0.49 0.78 0.77
Stu n 0.75 0.86 0.57 0.45
12

The causal understanding
𝛼
𝑗
𝛽
𝑗
𝜃
𝑖
𝑥
𝑖
𝑗
Discrimination of Q j Difficulty of Q j Ability of student i
Marks of student i for Question j
Student
Question
Marks
13

Dichotomous IRT
● Multiple choice
● True or false
●
● - outcome/score of examinee for item
● - examinee’s ability
● - guessing parameter for item
● - difficulty parameter
● - discrimination
𝜙
(
𝑥
𝑖
𝑗
= 1
𝜃
𝑖
,
𝛼
𝑗
,
𝑑
𝑗
,
𝛾
𝑗
) =
𝛾
𝑗
+
1 −
𝛾
𝑗
1 + exp( −
𝛼
𝑗
(
𝜃
𝑖
−
𝑑
𝑗
))
𝑥
𝑖
𝑗
𝑖𝑗𝜃
𝑖
(
𝑖
)
𝛾
𝑗
𝑗𝑑
𝑗
𝛼
𝑗
This Photo by Unknown Author is licensed under CC BY-NC
14

Continuous IRT
• Grades out of 100
• A 2D surface of probabilities
f (zj |θ) =
αjγj
2π
exp
(
−
α2
j
2 (θ − βj − γjzj)
2
)
15

Mapping algorithm evaluation to IRT
IRT Model
Students
Test questions
16
Problems/datasets
Algorithms

The new mapping
𝛼
𝑗
𝛽
𝑗
𝜃
𝑖
𝑥
𝑖
𝑗
Discrimination of Algo j Difficulty of Algo j Ability of Dataset i
Marks of dataset i for algorithm j
Dataset
Algorithm
Performance
17

Fitting the IRT model
• Maximising the expectation
• - discrimination parameter of algorithm
• - scaling parameter for the algorithm
• - difficulty parameter for the algorithm
• - score of the algorithm on the dataset/problem
• - prior probabilities
Eθ|Λ(t),Z [ln p (Λ|θ, Z)] = N
n
∑
j=1
(ln αj + ln γj) −
1
2
N
∑
i=1
n
∑
j=1
α2
j ((βj + γjzij − μ(t)
i )
2
+ σ(t)2
)
+ ln p (Λ) + const
αj
γj
βj
zij
Λ

New meaning of IRT parameters?
● Discrimination -> anomalousness, algorithm consistency
● Difficulty -> algorithm difficulty limit
● Student ability -> Dataset difficulty spectrum

What can we say . . .
• Consistent algorithms give similar performance for easy or hard datasets
• Algorithms with higher difficulty limits can handle harder problems
• Anomalous algorithms give bad performance for easy problems and good
performance for difficult problems
20

Dataset difficulty
●
● Is a function of discrimination, difficulty and scores
dataset i di
ffi
culty = −
∑j
̂
α2
j (
̂
βj + ̂
γjzij)
∑j
̂
α2
j

Example – fitting IRT model
• Anomaly detection algorithms
• 8 anomaly detection algorithms
• 3142 datasets
• Our performance metric is AUROC (looking at the performance vs actual)
22

Anomaly detection
algorithms
23
Algorithm Consistency Difficulty Limit Anomalous
Ensemble 55.08 -66.55 FALSE
LOF 4.50 5.10 FALSE
KNN 1.72 2.30 FALSE
FAST_ABOD 9.39 10.23 FALSE
Isolation Forest 2.35 3.05 FALSE
KDEOS_1 0.86 -0.31 TRUE
KDEOS-2 1.16 -0.51 TRUE
LDF 2.01 2.08 FALSE

Dataset difficulty spectrum
IRT Student Ability -> Dataset Difficulty
24

Focusing on algorithm performance
• We have algorithm performance (y axis)
and problem difficulty (x axis)
• We can fit a model and find how each
algorithm performs
• We use smoothing splines
• Can visualize them
• No parameters to specify
25

Strengths and weaknesses of algorithms
• If the curve is on top – it is strong in that region
• If the curve is at bottom – weak in that region
27

AIRT
IRT
Question and
student
characteristics
Algorithm and
dataset
characteristics
Visualise strengths
and weaknesses
Portfolio
construction
28

AIRT performs well, when . . .
• The set of algorithms is diverse.
• Ties back to IRT basics
• IRT in education – If all the questions are equally discriminative and difficult,
IRT doesn’t add much
• IRT useful when we have a diverse set of questions and we want to know
• Whichquestionsaremorediscriminative
• Whichquestionsaredifficult
29

AIRT
• Understand moreaboutalgorithms
• Anomalousness,consistency,difficultylimit
• Visualisethestrengthsandweaknessesofalgorithms
• Selectaportfolioofgood algorithms
• Includesdiagnostics
• Rpackage airt(onCRAN)
• https://sevvandi.github.io/airt/
•
30

A different application of IRT
31

IRT to build an ensemble
● Previous work was using performance - marks
● What about using original responses?
● Like survey questions
● Rosenberg's Self-Esteem Scale - I feel I am a person of worth
(Strongly agree/Agree/Neutral/Disagree/Strongly disagree)
● No right or wrong answer
● Latent trait gives the person’s self esteem
● Latent trait uncovers the “hidden quality”

Unsupervised algorithms
● Instead of performance values what if you have original responses?
Q 1 Q 2 Q 3 Q 4
Stu 1 0.95 0.87 0.67 0.84
Stu 2 0.57 0.49 0.78 0.77
Stu n 0.75 0.86 0.57 0.45
latent value (i) = −
∑j
̂
α2
j (
̂
βj + ̂
γjzij)
∑j
̂
α2
j

What is an anomaly detection ensemble?
Dataset
Unsupervised
AD methods
• The AD methods are heterogenous methods
• Ensembles – use existing methods to come up with better anomaly detection/
scores
AD ensemble
Ensemble
Score

Anomaly detection ensembles
● Latent trait = anomalousness of the observations = the ensemble score
●
, a weighted score of original responses
●
ensemble score of observa
ti
on (i) = −
∑j
̂
α2
j (
̂
βj + ̂
γjzij)
∑j
̂
α2
j

IRT Ensemble
36
𝑌
𝑁
×
𝑛
IRT
ensemble
Ensemble
Score

IRT for anomaly detection ensembles
• R package – outlierensembles – on CRAN

IRT and algorithms
● Algorithm evaluation
- AIRT
● Anomaly detection
ensembles
● IRT used in ML more

New AIRT parameters
●
●
●
● Where M, V and C denote mean, variance, and covariance terms
γ(t+1)
j
=
V (μ(t)
i ) + σ(t)2
Cj (zij, μ(t)
i )
β(t+1)
j
= M (μ(t)
i ) − γ(t+1)
j
Mj (zij)
α(t+1)
j
=
(
γ(t+1)2
j
Vj(zij) − V (μ(t)
i ) − σ(t)2
)
−1/2

Explaining notation
●
●
●
●
●
Mj (zij) =
∑i
zij
N
M (μ(t)
i ) =
∑i
μ(t)
i
N
V (zij) =
∑i
z2
ij
N
− Mj (zij)
2
V (μ(t)
i ) =
∑i
μ(t)2
i
N
− M (μ(t)
i )
2
Cj (zij, μ(t)
i ) =
∑i
zijμ(t)
i
N
− Mj (zij) M (μ(t)
i )

More notation
●
● is the th iteration
●
●
p (θi |Λ(t)
, zi) =
𝒩
(θi |μ(t)
i
, σ(t)2
)
Λ(t)
= (λ1
(t)
, …, λn
(t)
) , λj
(t)
= (α(t)
j
, β(t)
j
, γ(t)
j )
T
t
σ(t)2
=
∑
j
α(t)2
j
+ σ−2
−1
μ(t)
i
= σ(t)2
∑
j
α(t)2
j (β(t)
j
+ γ(t)
j
zij) + μ

Algorithm portfolio selection
• Can use algorithm strengths to select a good portfolio of algorithms
• We call this portfolio airt portfolio
• airt – Algorithm IRT (old Scottish word – to guide)
43

AIRT framework
Performance
data
IRT Model
Algorithm and
dataset metrics
Fitting
smoothing
splines
Algorithm
strengths and
weaknesses
Airt portfolio
Evaluate
portfolios
General block
AIRT output
44

What happens to the IRT parameters?
• IRT - ability of student
• As increases probability of a higher
score increases
• What is in terms of a dataset?
• easiness of the dataset
• Dataset difficulty score
θi
θi
θi
−θi
45

Discrimination parameter
• Discrimination of item
• increases slope of curve increases
• What is in terms of an algorithm?
• - lack of stability/robustness of
algo
•
Consistency of algo
αj
αj
αj
αj
1
|αj |
46

Consistent algorithms
• Education – such a question doesn’t
give any information
• Algorithms – these algorithms are
really stable or consistent
•
Consistency =
1
|αj |
47

Anomalous algorithms
• Algorithms that perform poorly on
easy datasets and well on difficult
datasets
• Negative discrimination
• In education – such items are
discarded or revised
• If an algorithm anomalous, it is
interesting
• Anomalousness = = sign(αj)
This Photo by Unknown Author is licensed under CC BY-NC-ND
48

Explainable insights on algorithm performance

Recommended

Recommended

More Related Content

Similar to Explainable insights on algorithm performance

Similar to Explainable insights on algorithm performance (20)

More from CSIRO

More from CSIRO (11)

Recently uploaded

Recently uploaded (20)

Explainable insights on algorithm performance