invited talk in the ExUM workshop in the UMAP 2022 conference
abstract:
Explainability has become an important topic both in Data Science and AI in general and in recommender systems in particular, as algorithms have become much less inherently explainable. However, explainability has different interpretations and goals in different fields. For example, interpretability and explanainability tools in machine learning are predominantly developed for Data Scientists to understand and scrutinize their models. Current tools are therefore often quite technical and not very ‘user-friendly’. I will illustrate this with our recent work on improving the explainability of model-agnostic tools such as LIME and SHAP. Another stream of research on explainability in the HCI and XAI fields focuses more on users’ needs for explainability, such as contrastive and selective explanations and explanations that fit with the mental models and beliefs of the user. However, how to satisfy those needs is still an open question. Based on recent work in interactive AI and machine learning, I will propose that explainability goes together with interactivity, and will illustrate this with examples from our own work in music genre exploration, that combines visualizations and interactive tools to help users understand and tune our exploration model.
2. Why do we need explainability?
• Model validation: avoid biases, unfairness or overfitting, detect
issues in the training data, adhere to ethical/legal requirements
• Model debugging and improvement: improving the model fit,
adversarial learning (fooling a model with ‘hacked’ inputs), reliability
& robustness (sensitivity to small input changes)
• Knowledge discovery: explanations provide feedback to the Data
Scientist or user that can result in new insights by revealing hidden
underlying correlations/patterns.
• Trust and technology acceptance: explanations might convince
users to adopt the technology more and have more control
3. “
”
Poll: What is a good explanation?
A: complete and accurate evidence for decision
B: gives a single good reason why this decision
C: tells me what I need to get a different decision
4. What is important for explainability in ML?
• Accuracy: does the explanation predict unseen data? Is it as
accurate as the model itself?
• Fidelity: does the explanation approximate the prediction of the
model? Especially important for black-box models (local
fidelity).
• Consistency: same explanations for different models?
• Stability: similar explanations for similar instances?
• Comprehensibility: do humans get it (see previous slide)
Some of these are hard to achieve with some models…
https://christophm.github.io/interpretable-ml-book/properties.html
5. What is a good explanation (for humans)?
Confalonieri et al. (2020) & Molnar (2020) based on Miller:
• Contrastive: why was this prediction made in stead of
another?
• Selective: focus on a few important causes (not all
features that contributed to the model).
• Social: should fit the mental model of the explainee /
target audience, consider the social context, and fit
their prior belie
• Abnormalness: humans like rare causes (related to
counterfactuals)
• (Truthfulness: less important for humans then
selectiveness!)
https://christophm.github.io/interpretable-ml-book/explanation.html
6. Machine learning / AI interpretability
Some methods are inherently interpretable (glass-box or white box models)
• Regression, decision trees, GAM
• Some RecSys algorithms (content based-or classical CF)
Many others are not: black-box models
• Neural networks (CNN/RNN), random forest, Matrix factorization etc
• often requires post-hoc explanations (leave the model intact)
Further distinction can be made between:
• Model-specific method (explanation is specific to the ML technique)
• Model-agnostic methods (explanation treats ML as black-box: use only the
input/outputs)
7. Explanations, can be global, component-based, or local
GAM
SHAP
Global explanation components / dependence plot local explanations
Interpreting Interpretability: Understanding Data Scientists'
Use of Interpretability Tools for Machine Learning
Kaur et al. CHI 2020
Data Scientists also do not get these visualizations… !
8. Global explanations (how does it work in general?)
How does the model perform on average for the dataset, overall
approximation of the (black box) ML model?
• Feature importance ranks: permutate/remove features and
see how the model output changes to find feature importance
• Feature effects: effect of a specific feature on the outcome of
the model: Partial Dependence Plots (marginal effects) or
Accumulated Local Effect plots (conditional effects)
9. local explanations: why do I get this prediction?
LIME (Local Interpretable Model-agnostic Explanations), an
algorithm that can explain the predictions of any classier or
regressor in a faithful way, by approximating it locally with an
interpretable (surrogate) model.
10. Local explanations that are model-agnostic…
By “explaining a prediction", we mean presenting textual or
visual artifacts that provide qualitative understanding of
the relationship between the instance's components (e.g.
words in text, patches in an image) and the model's
prediction.
Criteria:
Interpretable: provide qualitative understanding between the
input variables and the response.
local fidelity: for an explanation to be meaningful it must at
least be locally faithful
model-agnostic: an explainer should be able to explain any
model
11. LIME output: which algorithm works better?
Two algorithms with
similar accuracy
predicting if the text
below is about
Christianity or
atheism
Poll: Which model
should you trust
more 1 or 2?
12. Works very well, but…
Sentiment of the sentence “This is not bad”
LIME can show that the sentiment is
detected correctly because of the conjunction
of “not” and “bad”
Same results for two very different models
But do you notice a difference?
Valence of the decision class: which is more
understandable?
Logistic regression on unigrams
LSTM on sentence embeddings
Ribeiro et al. 2016, Model-Agnostic Interpretability of Machine Learning, arXiv:1606.05386v1
13. Improving understandability of feature contributions in
model-agnostic explainable AI tools (CHI 2022)
Sophia Hadash, Martijn Willemsen, Chris Snijders, and Wijnand IJsselsteijn
Jheronimus Academy of Data Science
Human-Technology Interaction, TU/e
14. Visualizations of LIME (and SHAP) can be counterintuitive!
Prediction class: bad (ineligible for loan) (Data: credit-g)
Cognitively challenging due to (double) negations!
17. Empirical User study
⚫ 133 participants (61 male), university database + convenience sampling
Factors:
⚫ Loan applications and music recommendations (within-subjects)
⚫ Framing: positive or negative (within-subjects)
⚫ Semantic labelling: no labels, “eligibility/like”, or “ineligibility/dislike”
⚫ Between-subject to prevent carry-over learning effects.
Measurement: perceived understandability using 4-pt Likert scale.
⚫ 6 trials per within-condition, 24 per participant
19. Results
Negatively framed semantic
labels do not improve
understandability.
⚫ (e.g. “+5% ineligibility”)
⚫ Not even when compatible
with the negative decision
class…
21. Take away: do not forget the psychology!
Positive framing always works better than negative
framing (even for negative decision classes).
• Requires that decision-classes are inherently “positive” or “negative”
Use of semantic labelling can improve understandability
of the visualizations of interpretability tools.
• Reduces framing effects!
22. Drawbacks of post-hoc explanations
These tools still just provide a retrospective explanation of the
outcome…
• Static, lack on contrastive, counterfactual insights…
Ben Shneiderman promoted prospective user interfaces
• Interactive tools that show you what aspects influence and
change the outcome of an AI
How would that work? It has already been done for decades!
23. “
”
How do we make explanations contrastive,
and selective?
How do we make sure they fit our mental
models and beliefs?
Let’s make them interactive!
24. Interactive ML is not new…
Dudley (2018) and Amershi (2019) show that two
decades of research already have looked at these
issues in communities like IUI and CHI…
Example: Crayons, 2003
Fails & Olson, Crayons, IUI (2003)
25. Traditional ML
Amershi et al. 2014: Power to the People
• ML works with experts on
feature selection / data
representation
• Use ML, build predictions, go
back to expert for validation
• Long and slow cycle, big steps,
• exploration is mostly on the side
of the ML/ data scientist
26. Interactive ML
• User directly interacts with the
model
• Incremental but fast updates,
small steps, low-cost trial & error
• Smaller cycles, gives better
understanding what happens
• Can be done by low-expertise
users
• Examples: recommender
systems and tools like Crayons
Amershi et al. 2014: Power to the People
27. Interface elements of an IML (Dudley 2018, sec. 4)
‘These elements represent distinct
functionality that the interface
must typically support to deliver a
comprehensive IML workflow’
Not necessarily physically distinct:
e.g., crayons merges sample
review and feedback assignment
29. Key Solution Principles according to Dudley (2018)
Exploit interactivity and promote rich interactions
• Interaction for understanding: many UX principles are hard to achieve
in IML (i.e. direct manipulation principles)
• Make the most of the user: balance effort and value of input, avoid
repeated requests, provide retrace of steps and undo
Engage the user
• Provide feedback, show partial predictions, do not ask trivial labeling
tasks
• might promote users to spend more time and improve the modeling
31. 18 guidelines
• UX design process
• Brings knowledge
from many related
fields together
• Goes back to earlier
classical work:
strongly founded in
Mixed initiatives
work of Horvitz (IUI
1999)
32. “
”
Two example applications of interactive AI
/ REcSys from my lab that I consider to be
Prospective user interfaces
33. Preparing for a marathon
Target finish times
Not too fast, not too slow
Pacing (min/km) strategy
Constant ‘flat’ speed is associated
with best performance
34
Heleen Rutjes
34. Prediction model for setting a
challenging, yet realistic finish time.
Model predictions are based on similar runners:
If runner *sunglasses* has had similar past performances as runner *hat*, yet has a
better Personal Best (PB), than runner *hat* can potentially achieve that too.
Approach: ‘case-based reasoning’ (CBR)
We asked coaches what aspects
they would like to control:
- Select similar runners?
- Select best races to serve as a case?
35 Research by Barry Smyth: http://medium.com/running-with-data/
35. Making the model interactive
Running coaches could indicate for every previous race how ‘representative’
they consider it.
By setting the slider, the model prediction
was continuously updated.
36
36. Model interactivity increased
trust and acceptance
Acceptance
Coaches showed to be more inclined to accept a model that they could interact with.
Trust
Model interactivity increased coaches’ perceived competence of the model.
37
“Without my adjustments the model did not make
sense, but by eliminating the race from Eindhoven,
we’re getting somewhere.”
(Coach 53, familiar runner, interactive condition)
37. Coaches improved the accuracy of the model
Model accuracy improved by coaches’ interactions
(Mean PercentError dropped from 3.14 to 2.33, p = 0.018)
What did the coaches adjust?
Systematic adjustments More recent races were indicated as more
representative. (p<0.001)
‘Anecdotal’ adjustments Based on knowledge of the specific runner,
running in general, environmental circumstances, etc.
Even when working with unfamiliar runners:
38
“There is clearly something going on with this lady. Maybe she
stopped training, or she has a persistent injury?”
(Coach 45, unfamiliar runner, non-interactive condition)
39. How to better support users to explore a new music genre?
40
[Millecamp, M., Htun, N. N., Jin, Y., & Verbert, K. 2018]
[Bostandjiev, S., O’Donovan, J., & Höllerer, T. (2012)]
[Andjelkovic, I., et al. 2019], [He, C., Parra, D., & Verbert, K. 2016]
40. Simple bar plot visualization to explain recommendation
41
[Millecamp, et al. 2019]
Easy to understand
Not very informative: present
only the averaged preferences
Bar charts
41. More complex contour plot visualization
42
1) Show the relation between the recommendations ,
users’ current preferences and the new genre
2) Show the preference intensity of users
Contour plots
Mood control
A bit hard to understand
42. Contour plot + Mood control (Most helpful?)
43
Easily see how
recommendation
changes
43. Contour plot + Mood control (Most helpful?)
44
Easily see how
recommendation
changes
44. Research questions
RQ1: How do different types of visualizations (bar charts/contour
plots) influence the perceived helpfulness for new music genre
exploration?
RQ2: How does mood control improve the perceived helpfulness
for new music genre exploration
45
45. Study design
2X2 mixed facotorial
design:
Mood control:
between-subject
Visualization:
within-subject
46
Interactive Music Genre Exploration with Visualization and Mood Control
46. Measurements
• Subjective measures: post-task quesionnaires
Perceived helpfulness, perceived control, perceived
informativeness and understandability
• Objective measures: user-interactions with the
system
• Musical Sophistication (active engagement &
emotional engagement)
47
Interactive Music Genre Exploration with Visualization and Mood Control
• Participants: mainly university students
• 102 valid reponses Fig. Genre selection frequencies
Genres they wanted to explore
47. Which is more helpful?
48
Interactive Music Genre Exploration with Visualization and Mood Control
Contour plot (vs bar charts):
• More helpful
• Total effect: 𝛽 = .378, 𝑠𝑒 =
.082, 𝑝 < .001)
Control (vs no control):
• Seems to be more helpful
• Total effect: 𝛽 = .238, 𝑠𝑒 = .123, 𝑝
= .053 (marginal significant)
Contour + control:
• More helpful
• Total effect: 𝛽 = 0.242, 𝑠𝑒
= 0.123, 𝑝 = .049).
48. What we have found….
Good visualization is key for understandability and explainability
Contour plot is perceived more helpful than the bar chart
• More informative, thus more understandable & helpful
• Better mental model?
Interaction only helps with good mental model/understanding
Mood control itself does not make the system more helpful
• paired with the contour plot it benefits the perceived
helpfulness mostly due to increased informativeness
49
49. Further work on genre exploration
RecSys 2021: the role of default settings on genre
selection and exploration:
• tradeoff slider: from genre representative to more
personalized songs
• Defaults had a strong effect on how far users explored…
RecSys 2022 (just accepted): Longitudinal study in which
they used the same tool for 4 weeks
• Default effects fade over the weeks
• Users find the tool helpful / keep exploring after 4 weeks
• Some actual change in music profile after 6 weeks!
50. Conclusions
Two separate worlds:
• interactive Machine Learning: interpretability for data scientists
• human-AI interaction work focused on the user at CHI, UMAP,
IUI (and RecSys)
We should learn from each other and bring them more together!
Human-AI interaction requires solid understand of mental models,
cognitive processes and biases, visualization guidelines and user
experience research!