2. Vision
Harness the relative strength of
humans and machine learning models
2
Human
Machine
Learning
Models
http://blogs.teradata.com/
3. Research objectives
Develop machine learning models inspired by how
humans think that can…
3
Human
Machine
Learning
Models
http://blogs.teradata.com/
7. Develop machine learning models inspired by how
humans think that can…
1. Infer human team
decisions from
team planning
conversation
2. Communication
from machine to
human:
provide intuitive
explanations
3. Communication
from human to
machine:
incorporate
feedback
Go#here#
infer decisions
of humans
make sense to
humans
interact with
humans
7
Research objectives
8. Road map
8
2. Communication
from machine to
human:
provide intuitive
explanations
3. Communication
from human to
machine:
incorporate
feedback
make sense to
humans
interact with
humans
1. Infer human team
decisions from
team planning
conversation
Go#here#
infer decisions
of humans
9. Road map
1. Infer human team
decisions from
team planning
conversation
9
2. Communication
from machine to
human:
provide intuitive
explanations
3. Communication
from human to
machine:
incorporate
feedback
infer decisions
of humans
make sense to
humans
interact with
humans
Go#here#
10. • Human’s tactical decision is based on
exemplar-based reasoning (matching and
prototyping) [Cohen 96, Newell 72]
• Skilled fire fighters use recognition-primed
decision making — a situation is matched
to typical cases [Klein 89]
• Machines can better support peoples’
decision-making by representing data in
the same way
Mirror the way humans think
10
11. Case-based reasoning and
interpretable models
11
Case-based reasoning
• Applied to various applications thanks
to its intuitive power
[Aamodt 94, Slade 91, Bekkerman 06]
Limitations
• Always require labels (supervised)
• Does not scale to complex problems
• Does not leverage global patterns of
data
Interpretable models
• Decision tree [De`ath 00]
• Sparse linear classifiers
[Tibshirani 96, Ustun 14]
• Prototype-based [Graf 09]
Limitations
• Sparsity is not enough [Freitas 14]
• Linear models or supervised
12. Our approach:
Bayesian Case Model (BCM)
*
Bayesian generative models
Case-based reasoning
Bayesian Case Model (BCM)
• Leverage the power of examples (prototypes) and
subspaces (hot features) to explain machine
learning results
prototypes
subspaces
Explain
complicated
concepts using
examples
12
[Kim, Rudin, Shah NIPS 2014]
14. • A general framework for Bayesian case-based reasoning
• Joint inference on prototypes, subspaces and cluster labels
Cluster A
Bayesian Case Model (BCM)
…
Cluster B Cluster C
prototypes subspaces cluster labels
14
15. subspaces
prototypes
Explanations provided by
Bayesian Case Model (BCM)
salsa
sour cream
avocado
salt, pepper, taco
shell, lettuce, oil
Taco
flour
egg
water, salt, milk,
butter
Basic crepe
chocolate
strawberry
pie crust, whipping cream,
kirsch, almonds
Chocolate berry tart
Cluster A Cluster B Cluster C
15
16. Prototype
Quintessential observation
that best represents the cluster
Subspace
sets of important features
in characterizing clusters
• A general framework for Bayesian case-based reasoning
• Joint inference on cluster labels, prototypes and subspaces
Bayesian Case Model (BCM)
salsa
sour cream
avocado
Taco
Explain Cluster A
1. clustering
2. learning
explanation
prototypes subspacescluster labels
16
17. It is a crepe, since it has flour and egg.
It is inspired by Mexican food, because
it has avocado, salsa and sour cream.
Cluster labels:
• Admixture model for modeling the underlying distributions
Cluster A Cluster B Cluster C
= [A, B, A]mexican_crepe
Bayesian Case Model (BCM)
1. Clustering part
17
18. It is a crepe, since it has flour and egg.
It is sweet crepe that is like chocolate
and berry dessert.
• Admixture model for modeling the underlying distributions
Cluster A Cluster B Cluster C
= [B, C, C]chocolate_crepe
Bayesian Case Model (BCM)
1. Clustering part
18
19. • Cluster distribution + supervised classification methods can be
used for evaluating the clustering performance[1]
• Hyper parameter can be used to control how many different cluster
labels within one data point
The concentration
parameter:
Cluster distribution
of the data point
[1] D. Blei, A. Ng, M. Jordan 2003
Bayesian Case Model (BCM)
1. Clustering part
19
20. Subspaces
binary variable
1 for important
features
• Each cluster is characterized by a prototype and subspaces
Prototype
Bayesian Case Model (BCM)
2. Learning explanation part
20
21. A prototype is an
actual data point that
exists in the dataset
Prototype
Bayesian Case Model (BCM)
2. Learning explanation part
• prototype: quintessential observation that best represents the cluster
21
22. Bayesian Case Model (BCM)
2. Learning explanation part
• subspace: sets of important features in characterizing clusters
Subspaces
binary variable
1 for important
features
22
23. Subspaces
binary variable
1 for important
features
• subspace: sets of important features in characterizing clusters
Any similarity measure can be used.
For example, using any loss function:
The feature j of
cluster s is an
important feature
(i.e., subspace)
The value of feature j
is identical to the
value of the
prototype of clusters
Bayesian Case Model (BCM)
2. Learning explanation part
23
24. Results
Challenges of interpretable models
1. Do the learned prototypes and subspaces make
sense?
2. Are we sacrificing performance for the interpretability?
3. Do learned prototypes and subspaces help humans’
understanding?
24
25. Data from computer cooking contest: liris/cnrs.fr/ccc/ccc2014
• Unsupervised clustering on a subset of recipe data
1. Do the learned prototypes and subspaces make sense?
BCM on recipe data
25
sesam
e
26. 1. Do the learned prototypes and subspaces make sense?
BCM on digit data
http://www.cs.nyu.edu/~roweis/data.html
26
27. Learned cluster D
Gibbs sampling iteration
1. Do the learned prototypes and subspaces make sense?
BCM on digit data
27
28. 2. Are we sacrificing anything for the interpretability?
Maintain accuracy
Handdigit
dataset
20Newsgroups
dataset
Sensitivity
Analysis
BCM BCM
28
29. 2. Are we sacrificing anything for the interpretability?
Joint inference on prototypes,
subspaces and cluster labels is the key
Posterior distribution
Level set
Another solution
that clusters data equally well
and
has better interpretability
—- BCM gives higher score for
this solution
One solution
that clusters
data well
29
30. Collapsed Gibbs sampling
for inference
• Observed to converge quickly in admixture models
• Integrating out and for efficient inference
30
[Kim, Rudin, Shah NIPS 2014]
31. 3. Does the model make sense to humans?
Objective measure of human understanding
Accuracy of human classifier
a new data
point to be
classified
• Participant’s task is to assign
the ingredients of a specific
dish (a new data point) to a
cluster
• Each cluster is explained
using either BCM or LDA
31
32. • 384 classification questions asked to 24 people
• Statistically significantly better performance with BCM
(85.9% v.s. 71.3%)
a new data
point to be
classified
Explanations of clusters
Clusters explained
using
1. BCM :
ingredients of the
prototype recipe
2. LDA:
representative
ingredients of
each cluster
3. Does the model make sense to humans?
Objective measure of human understanding
Accuracy of human classifier
32
[Kim, Rudin, Shah NIPS 2014]
sesam
e
33. Road map
1. Infer human team
decisions from
team planning
conversation
33
2. Communication
from machine to
human:
provide intuitive
explanations
3. Communication
from human to
machine:
incorporate
feedback
make sense to
humans
interact with
humans
Go#here#
infer decisions
of humans
37. Related work on
interactive machine learning
• Interact via multiple model parameter
settings [Patel 10, Amershi 15]
• Design smart interfaces [Amershi 11]
and visualization [Chaney 12, Gou 03]
• Interact via simplified medium of
interaction [Kapoor 10, Ware 01]
Prototypes
and
Subspaces!
37
38. interactive BCM (iBCM)
38
BCM iBCM
Double circled nodes
represent interacted
latent variables —
Node that get
information from both
user feedback and
information obtained
from data points
39. interactive BCM (iBCM)
39
BCM iBCM
Double circled nodes
represent interacted
latent variables —
Node that get
information from both
user feedback and
information obtained
from data points
40. interactive BCM (iBCM)
internal mechanism
40
3. Listen to
Data
Key: Balance between what the data indicates and
what makes most sense to the user
Our approach: Decompose Gibbs sampling steps to
1) adjust the feedback propagation depending on user’s confidence
2) accelerate the inference by rearranging latent variables
2. Propagate
Users feedback
to accelerate
inference
1. Listen to
Users
41. User’s workflow with iBCM
abstract domain
41
click to change
to
to
click to promote
any items to be
prototype
42. Experiment procedure
1. Subjects are asked how they want to
group items
2. Subjects view results from BCM
• Essentially shows one of the
optimal clustering
3. Subjects indicate how well the results
matched their preferred clustering
4. Subjects interact with iBCM
5. Subjects indicate how well the results
matches with what they want
42
Compare 24 participants, 192 questions
43. Experiment results
1. Subjects are asked how they want to
group items
2. Subjects view results from BCM
• Essentially shows one of the
optimal clustering
3. Subjects indicate how well the results
matched their preferred clustering
4. Subjects interact with iBCM
5. Subjects indicate how well the results
matches with what they want
43
24 participants, 192 questions
Participants agreed more
strongly that final clusters
matched their preferences
compared to the initial
clusters
Wilcoxon signed rank test
44. iBCM for introductory
programming education
44
• Why education?
• Current teachers’ workflow for creating grading rubric:
randomly pick 4-5 assignments and Hodgepodge Grading
[Cross 99]
• Understanding this variation is important for providing
appropriate, tailored feedback to students [Basu13, Huang 13]
• What are the challenges?
• Extracting right features — OverCode [Glassman 15]
45. iBCM + OverCode system
45submissions from MIT introductory python classes
47. iBCM experiment with
domain experts results
Click here to get a
new grouping
V.S.
Task: Explore the full spectrum of students’ submissions and
write down `discovery list’ for a recitation
47
49. Experiment with
domain experts results
• 48 problems explored by 12
subjects who previously
taught introductory python
class
• participants agreed more
strongly to the following
compared to BCM ( )
Were more satisfied
Better explored the full spectrum of
students’ submissions
Better identified important features to
expand discovery list
Important features and prototypes are
useful𝑝 < 0.001
49
with iBCM, they…
Wilcoxon signed rank test
50. Experiment with
domain experts results
• 48 problems explored by 12
subjects who previously
taught introductory python
class
• participants agreed more
strongly to the following
compared to BCM ( )
Were more satisfied
Better explored the full spectrum of
students’ submissions
Better identified important features to
expand discovery list
Important features and prototypes are
useful𝑝 < 0.001
50
with iBCM, they…
“[iBCM enabled me to] go in depth
as to how students could do”
“ [iBCM] is useful with large datasets
where brute-force would not be practical.”
Wilcoxon signed rank test
51. Summary
51
[Kim, Chacha, Shah AAAI13]
[Kim, Chacha, Shah JAIR15]
Communication from
machine to human:
provide intuitive
explanations
make sense to
humans
interact with
humans
[Kim, Rudin, Shah NIPS 2014]
[Kim, Glassman, Johnson, Shah submitted*]
[Kim, Patel, Rostamizadeh, Shah AAAI 2015]
Inspiration: how humans
make decisions
Approach: case-based
Bayesian model
Results: provided intuitive
explanations while
maintaining performance
Approach: enable
interaction by
decomposing sampling
inference steps
Results: implemented and
validated the approach in
education domain
Communication from
human to machine:
incorporate feedback
52. miss-classified data
Next steps
• Interpretability for data
exploration: visualization
• Domain specific interpretability:
learning features that
distinguishes clusters
• Interactive machine learning for
debugging models or hyper
parameter explorations
predicted:
politics
Doc id #24
True label: medicine
[Kim, Patel, Rostamizadeh, Shah AAAI 2015]
52
[Kim, Doshi-Velez, Shah NIPS 2015]
53. Next steps at AI2
• Extend interpretability for initially
uninterpretable features (neural nets)
53
4th grade
science
exam
question
54. Q&A
[Kim, Chacha, Shah AAAI13]
[Kim, Chacha, Shah JAIR15]
Communication from
machine to human:
provide intuitive
explanations
make sense to
humans
interact with
humans
[Kim, Rudin, Shah NIPS 2014]
[Kim, Glassman, Johnson, Shah submitted*]
[Kim, Patel, Rostamizadeh, Shah AAAI 2015]
Inspiration: how humans
make decisions
Approach: case-based
Bayesian model
Results: provided intuitive
explanations while
maintaining performance
Approach: enable
interaction by
decomposing sampling
inference steps
Results: implemented and
validated the approach in
education domain
Communication from
human to machine:
incorporate feedback
[Kim, Doshi-Velez, Shah NIPS 2015]
AI2 is hiring
research interns
any time of the year.
Shoot me an email
if interested!
beenk@allenai.org