Improved interpretability for Computer-Aided Assessment of Retinopathy of Prematurity
1. 1
Improved Interpretability for Computer-Aided Severity
Assessment of Retinopathy of Prematurity
M. Graziani, J. M. Brown, V. Andrearczyk, V. Yildiz, J. P. Campbell, D. Erdogmus, S.
Ioannidis, M. F. Chiang, J. Kalpathy-Cramer, and H. Müller
2. 2
What is Retinopathy Of Prematurity (ROP)?
‣ Abnormal growth of blood vessels in the retina
‣ 14,000 to 16,000 premature infants (per year, in U.S.)1, 10 % requires prompt
treatment to avoid retina detachment, blindness (incidence is growing)
‣ Staging from 1 to 5, detection of pre-plus or plus
‣ Detection of Plus disease, high disagreement among experts
1. U.S National Eye Institute (nei.nih.gov/health/rop/rop)
Fig 1. Stage 3: visual examples of Normal, Pre-plus and Plus cases
3. 3
Detection of Plus: a Machine Learning approach
Four steps:
‣ Vessel Segmentation
‣ Centerline Tracing
‣ Feature Extraction
‣ Classification
* Rate of changing velocity between points wrt curve length
Table 1. Feature types and descriptions
We extract 11 types of handcrafted features from the images, whose importance to the evaluation o
disease was evaluated by Aeter-Cansizoglu et al.2
The impact of the features choice on the diag
thoroughly investigated in the literature and the selected method constitutes a reference standard
inter-expert agreement.2
For each type of feature, 8 traditional statistics (such as minimum, maximu
median and second and third moments) and 5 Gaussian Mixture Model (GMM) statistics are extr
a total of 143 handcrafted features (more details in the Appendix 1). Such features are extracted
automated vessel segmentations and express curvature, tortuosity and dilation of retinal arteries
(details reported in Table 1). The features in Table 1 are first computed independently for each ve
image. The ”vesselness” of the whole retinal sample is then summarized by standard statistics such as
median of the per-vessel features. A ranking of the features is computed on the basis of their Gini coe
random forest classification of normal and pre-plus or worse on 100 random train-test splits (with rep
of the data. The retaining criterion used for this analysis identified a set of six measures that covered
of clinically interpretable features, discarding measures with a frequency of appearance lower than 1
ranking. The retained measures were: curvature mean, curvature median, avg point diameter mean, av
diameter mean, cti mean and cti median. Notwithstanding, the same analysis can be repeated with
criterion or with the exhaustive analysis of all the 143 features.
Feature Description Clinical interpretation
curvature (s) rate of direction change
avg segment diameter #pixels/Lc(x) global dilation
avg point diameter Wn(x) absolute dilation
Cumulative Tortuosity Index (CTI) cti(x) = Lc(x)/Lx(x) curving, curling, twisting rate
Table 1: Handcrafted feature description and clinical interpretation. (s) describes the rate of chang
ity between points with respect to the rate of changing the curve length between points. Lc and
respectively curve and chord length. Wn denotes the width of vessel on the normal direction.
3.3.2 Regression Concept Vectors
RCVs are computed by seeking in the activation space of a layer the direction of greatest increase
measurements for one retinal concept. This direction is computed as the LLS regression of the retin
Vessel width
curve length chord length
*
segmentation tracing feature eng. classification
Fig 2. Classification pipeline of handcrafted features
4. 2
copy and crop
input
image
tile
output
segmentation
map
64
1
128
256
512
1024
max pool 2x2
up-conv 2x2
conv 3x3, ReLU
572
x
572
284²
64
128
256
512
570
x
570
568
x
568
282²
280²
140²
138²
136²
68²
66²
64²
32²
28²
56²
54²
52²
512
104²
102²
100²
200²
30²
198²
196²
392
x
392
390
x
390
388
x
388
388
x
388
1024
512 256
256 128
64
128 64 2
conv 1x1
Fig. 1. U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue
box corresponds to a multi-channel feature map. The number of channels is denoted
on top of the box. The x-y-size is provided at the lower left edge of the box. White
boxes represent copied feature maps. The arrows denote the di↵erent operations.
as input. First, this network can localize. Secondly, the training data in terms
of patches is much larger than the number of training images. The resulting
network won the EM segmentation challenge at ISBI 2012 by a large margin.
Obviously, the strategy in Ciresan et al. [1] has two drawbacks. First, it
is quite slow because the network must be run separately for each patch, and
there is a lot of redundancy due to overlapping patches. Secondly, there is a
trade-o↵ between localization accuracy and the use of context. Larger patches
require more max-pooling layers that reduce the localization accuracy, while
small patches allow the network to see only little context. More recent approaches
[11,4] proposed a classifier output that takes into account the features from
multiple layers. Good localization and the use of context are possible at the
same time.
In this paper, we build upon a more elegant architecture, the so-called “fully
convolutional network” [9]. We modify and extend this architecture such that it
works with very few training images and yields more precise segmentations; see
Figure 1. The main idea in [9] is to supplement a usual contracting network by
successive layers, where pooling operators are replaced by upsampling operators.
Hence, these layers increase the resolution of the output. In order to localize, high
resolution features from the contracting path are combined with the upsampled
4
Detection of Plus: a Deep Learning approach
Inception V1
Normal
Preplus
Plus
UNet
Performances significantly higher than non-experts!
Fig 3. End-to-end classification with Deep Learning
[Brown J., et al., 2018] SPIE 2018
5. 2
copy and crop
input
image
tile
output
segmentation
map
64
1
128
256
512
1024
max pool 2x2
up-conv 2x2
conv 3x3, ReLU
572
x
572
284²
64
128
256
512
570
x
570
568
x
568
282²
280²
140²
138²
136²
68²
66²
64²
32²
28²
56²
54²
52²
512
104²
102²
100²
200²
30²
198²
196²
392
x
392
390
x
390
388
x
388
388
x
388
1024
512 256
256 128
64
128 64 2
conv 1x1
Fig. 1. U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue
box corresponds to a multi-channel feature map. The number of channels is denoted
on top of the box. The x-y-size is provided at the lower left edge of the box. White
boxes represent copied feature maps. The arrows denote the di↵erent operations.
as input. First, this network can localize. Secondly, the training data in terms
of patches is much larger than the number of training images. The resulting
network won the EM segmentation challenge at ISBI 2012 by a large margin.
Obviously, the strategy in Ciresan et al. [1] has two drawbacks. First, it
is quite slow because the network must be run separately for each patch, and
there is a lot of redundancy due to overlapping patches. Secondly, there is a
trade-o↵ between localization accuracy and the use of context. Larger patches
require more max-pooling layers that reduce the localization accuracy, while
small patches allow the network to see only little context. More recent approaches
[11,4] proposed a classifier output that takes into account the features from
multiple layers. Good localization and the use of context are possible at the
same time.
In this paper, we build upon a more elegant architecture, the so-called “fully
convolutional network” [9]. We modify and extend this architecture such that it
works with very few training images and yields more precise segmentations; see
Figure 1. The main idea in [9] is to supplement a usual contracting network by
successive layers, where pooling operators are replaced by upsampling operators.
Hence, these layers increase the resolution of the output. In order to localize, high
resolution features from the contracting path are combined with the upsampled
5
Detection of Plus: a Deep Learning approach
>5K images
3024 training
(1084 normal; 1074 pre-plus; 1080 plus)
965 validation
(817 normal; 148 pre-plus; 20 plus)
Inception V1
Normal
Preplus
Plus
UNet
Performances comparable to experts and
significantly higher than non-experts
TRUST ?
INTERPRET ?
EXPLAIN ?
6. 2
copy and crop
input
image
tile
output
segmentation
map
64
1
128
256
512
1024
max pool 2x2
up-conv 2x2
conv 3x3, ReLU
572
x
572
284²
64
128
256
512
570
x
570
568
x
568
282²
280²
140²
138²
136²
68²
66²
64²
32²
28²
56²
54²
52²
512
104²
102²
100²
200²
30²
198²
196²
392
x
392
390
x
390
388
x
388
388
x
388
1024
512 256
256 128
64
128 64 2
conv 1x1
Fig. 1. U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue
box corresponds to a multi-channel feature map. The number of channels is denoted
on top of the box. The x-y-size is provided at the lower left edge of the box. White
boxes represent copied feature maps. The arrows denote the di↵erent operations.
as input. First, this network can localize. Secondly, the training data in terms
of patches is much larger than the number of training images. The resulting
network won the EM segmentation challenge at ISBI 2012 by a large margin.
Obviously, the strategy in Ciresan et al. [1] has two drawbacks. First, it
is quite slow because the network must be run separately for each patch, and
there is a lot of redundancy due to overlapping patches. Secondly, there is a
trade-o↵ between localization accuracy and the use of context. Larger patches
require more max-pooling layers that reduce the localization accuracy, while
small patches allow the network to see only little context. More recent approaches
[11,4] proposed a classifier output that takes into account the features from
multiple layers. Good localization and the use of context are possible at the
same time.
In this paper, we build upon a more elegant architecture, the so-called “fully
convolutional network” [9]. We modify and extend this architecture such that it
works with very few training images and yields more precise segmentations; see
Figure 1. The main idea in [9] is to supplement a usual contracting network by
successive layers, where pooling operators are replaced by upsampling operators.
Hence, these layers increase the resolution of the output. In order to localize, high
resolution features from the contracting path are combined with the upsampled
6
Detection of Plus: a Deep Learning approach
>5K images
3024 training
(1084 normal; 1074 pre-plus; 1080 plus)
965 validation
(817 normal; 148 pre-plus; 20 plus)
Inception V1
Normal
Preplus
Plus
UNet
Performances comparable to experts and
significantly higher than non-experts
TRUST ?
INTERPRET ?
EXPLAIN ?
If we only could make sure
that the network is looking
at the same things that we
look at…
7. Can we relate hand-crafted visual
features to DL features?
7
8. 8
Interpretability with Concept Activation Vectors
Classification in the
activation space
Relevance scores
for each concept
Select concept and images
1
[Kim B. et al., 2018] ICML 2018
Fig 4. credits: Testing with Concept Activation Vectors, Kim B. Et al, 2018
9. 9
Interpretability with Concept Activation Vectors
Classification in the
activation space
Relevance scores
for each concept
Select concept and images
1
2
[Kim B. et al., 2018] ICML 2018
Fig 4. credits: Testing with Concept Activation Vectors, Kim B. Et al, 2018
10. 10
Interpretability with Concept Activation Vectors
Classification in the
activation space
Relevance scores
for each concept
Select concept and images
1
2
3
[Kim B. et al., 2018] ICML 2018
Fig 4. credits: Testing with Concept Activation Vectors, Kim B. Et al, 2018
11. 11
Concept measures for continuous features
‣ Medical applications often rely on continuous measures, which can be
related to a clinical interpretation
‣ Such measures are often used to compute hand-crafted visual features.
‣ Regression Concept Vectors extend TCAV to continuous concept
measures [Graziani et al., 2018]
[Graziani et al., 2018] iMIMIC at MICCAI 2018
ter-expert agreement.2
For each type of feature, 8 traditional statistics (such as minimum, maximum, mean,
edian and second and third moments) and 5 Gaussian Mixture Model (GMM) statistics are extracted, for
total of 143 handcrafted features (more details in the Appendix 1). Such features are extracted from the
utomated vessel segmentations and express curvature, tortuosity and dilation of retinal arteries and veins
details reported in Table 1). The features in Table 1 are first computed independently for each vessel in the
mage. The ”vesselness” of the whole retinal sample is then summarized by standard statistics such as mean and
edian of the per-vessel features. A ranking of the features is computed on the basis of their Gini coefficient for
ndom forest classification of normal and pre-plus or worse on 100 random train-test splits (with replacement)
the data. The retaining criterion used for this analysis identified a set of six measures that covered a wide set
clinically interpretable features, discarding measures with a frequency of appearance lower than 10% in the
nking. The retained measures were: curvature mean, curvature median, avg point diameter mean, avg segment
ameter mean, cti mean and cti median. Notwithstanding, the same analysis can be repeated with a di↵erent
iterion or with the exhaustive analysis of all the 143 features.
Feature Description Clinical interpretation
curvature (s) rate of direction change
avg segment diameter #pixels/Lc(x) global dilation
avg point diameter Wn(x) absolute dilation
Cumulative Tortuosity Index (CTI) cti(x) = Lc(x)/Lx(x) curving, curling, twisting rate
able 1: Handcrafted feature description and clinical interpretation. (s) describes the rate of changing veloc-
y between points with respect to the rate of changing the curve length between points. Lc and Lx denote
spectively curve and chord length. Wn denotes the width of vessel on the normal direction.
3.2 Regression Concept Vectors
CVs are computed by seeking in the activation space of a layer the direction of greatest increase of a set of
easurements for one retinal concept. This direction is computed as the LLS regression of the retinal concept
Table 1. Feature types and clinical interpretation
segmentation
12. 12
Regression Concept Vectors (RCVs)
= ‘pre-plus’ large
Trained network
= ‘normal’
= ‘plus’
Main steps:
‣ Selection of concept measures (set of images, annotations)
‣ LLS regression of the concept measure given the activation vector
‣ Computation of sensitivity and relevance scores
‣ Replace classification with Linear Least Squares (LLS) regression of the
concept measures for a set of inputs
Fig 5. Interpretation of the model of Brown et al. (SPIE 2018) with RCVs
13. ‣ for individual explanations
x
13
Sensitivity and relevance
‣ Br for global explanations
Regression
determination
coefficient
Standard
deviation
Mean of
the
Fig 6. Directional derivative of the decision function
over the RCV direction
More details can be found in [Graziani et al., 2018] iMIMIC at MICCAI 2018
14. 14
Regression is better in plus
Figure 7. Comparison of the R2 for inputs of class normal (left) vs plus (right)
15. 15
Sensitivity is negative for normal inputs
raw segmented
Individual relevance
pn = 0.22
ppre = 0.70
pplus = 0.08
GT: normal; prediction: normal
cti median
cti mean
curvature median
Avg point diameter mean
Avg segment diameter median
curvature mean
raw segmented
Individual relevance
pn = 0.99
ppre = 0.009
pplus = 0.0
1.082
1.168
0.118
0.447
5.24
5.89
-1 1
0
1.030
1.045
0.040
0.095
3.775
4.247
Original
concept
measures
Figure 8. Interpretation of the network’s decision for the single datapoint
16. 16
Bidirectional scores give global explanations
Figure 9. Comparison of the R2 for inputs of class normal (left) vs plus (right)
17. 17
Summary
Can we relate hand-crafted visual features to DL features?
YES!
‣ RCVs allow to measure the relevance of hand-crafted visual features
in the end-to-end classification by the deep network
‣ Sensitivity scores are large and positive for plus cases
‣ Concepts of tortuosity, curvature and dilation are relevant to the
classification