Medical Segmentation Decathalon

Medical Segmentation Decathlon
Generalisable 3D Semantic Segmentation

Origins
Are you enjoying MICCAI?

Origins
Yes. A lot of cool work, but its
hard to know what figure out
relevant contributions.

Origins
Agree. MSKCC has quite a few
segmentation tasks that need to be
solved, and I’m not sure where to
start.

Origins
It is indeed complicated to
know what to use for what
task.

Origins
Wouldn’t it be cool if there was one
method that would just work for
everything?

Origins
Hmm… Yes, but to test that we
would need a lot of data.

Origins
I have loads data! 
Would be delighted to share it if that
helps finding such a solution.

• Semantic segmentation algorithms are increasingly general purpose
o Models are translatable to unseen tasks.
• Algorithmic advances are commonly validated on one or two tasks
o This limits our understanding of the generalisability of the proposed contributions.
• Imaging-based care has many “tasks”
o A model that ”just works” would have a tremendous impact on healthcare.
Challenges – Algorithm Generalisability

• The field is missing a benchmark for general purpose algorithmic validation
• The approach should be fully open source and comprehensive
• Benchmark should be testing for a large span of challenges
o Big vs Small data
o Balanced vs Unbalanced labels
o Small vs Large Objects
o Single vs Multi-class labels
o Mono- vs Multi-modal Imaging
Challenges – Open Benchmark

• Low Cardinality
o Finding data is hard
o Getting ethics/governance approval is harder
• Constrained License
o Limits publishing rights
o Data reuse
o Cant be used for commercial applications
• Low coverage of anatomical appearances
Challenges – Open Data

• Metrics/Statistics
o Maier-Hein et al. “Is the winner really the best?”
o Tried and tested metrics
o Pre-submission Published Ranking
• Open Implementation
o COMIC
o Open Metrics code
• Continuous submission system
Challenges – Best Practices

• Registration free data download
• Permissive copyright-license (CC-BY-SA 4.0),
o This allows for data to be shared, distributed and improved upon.
o Only constraints:
- Attribution
- Share-Alike
• All data has been labelled and verified by an expert human rater
o Best effort to mimic the accuracy required for clinical use.
Open Data

The aim is to develop an algorithm or learning system that can solve each task,
separately, without human interaction. This can be achieved through the use of a
single learner, an ensemble of multiple learners, architecture search, curriculum
learning, or any other technique, as long as task-specific model parameters are
not human-defined.
Problem Statement

Phase 1
1. Data for 7 tasks was released on the 11th of May.
2. Develop Algorithm
3. Train/Test without human interaction
4. Submit the segmentation results by the 5th of August.
Phase 2
1. Release 3 more tasks on the 6th of August.
2. Train their previously developed algorithm, without any software modifications
3. Submit results of the last 3 tasks by the 31st August.
Process - Phase 1

• 1000+ data downloads
• 187 Registrations
• 31 Phase 1 Submissions
• 19 Fully Complete Submissions
Participation

• Phase 1:
o Best Method
• Phase 2:
o Best Method
o Runner Up x2
• NVIDIATitanV ($2999)
o 110 teraflops of compute power
• Sponsored by NVIDIA
Award

15:00 - 15:10: Introduction
15:10 - 15:20: Data Description
15:20 - 15:35: Metrics and Statistics Methodology
15:35 - 15:50: NVIDIA
15:50 - 16:00: Phase 1 Results
16:00 - 16:25: Presentation: Top 5 methods (Phase 1)
16:25 - 16:35: Phase 2 Results
16:35 - 16:55: Panel discussion
16:55 - 17:00: Conclusions and closing
Plan for today

Data
Liver Tumors Brain Tumors Hippocampus Lung Tumors Prostate
Cardiac Pancreas Tumor Colon Tumor Hepatic Vessels Spleen

Target: Liver and tumour
Modality: Portal venous phase CT
Size: 201 3D volumes (131 Training + 70 Testing)
Source: IRCAD Hôpitaux Universitaires
Challenge: Label unbalance with a large (liver) and small
(tumour) target
Data
Liver Tumors

Target: Gliomas segmentation necrotic/active tumour and
oedema
Modality: Multimodal multisite MRI data (FLAIR, T1w,
T1gd,T2w)
Source: BRATS 2016 and 2017 datasets
Challenge: Complex and heterogeneously-located targets
Data
Brain Tumors

Target: Hippocampus head and body
Modality: Mono-modal MRI
Source: Vanderbilt University Medical Center
Challenge: Segmenting two neighbouring small structures
with high precision
Data
Hippocampus

Target: Lung and tumours
Modality: CT
Source: The Cancer Imaging Archive
Challenge: Segmentation of a small target (cancer) in a large
image
Data
Lung Tumors

Target: Prostate central gland and peripheral zone
Modality: Multimodal MR (T2, ADC)
Source: Radboud University, Nijmegen Medical Centre
Challenge: Segmenting two adjoint regions with large inter-
subject variations
Data
Prostate

Target: Left Atrium
Modality: Mono-modal MRI
Source: King’s College London
Challenge: Small training dataset with large variability
Data
Cardiac

Target: Pancreas and tumour/cystic lesion
Modality: Portal venous phase CT
Size: 420 3D volumes (282 Training +139 Testing)
Source: Memorial Sloan Kettering Cancer Center
Challenge: Label unbalance with large (background), medium
(pancreas) and small (tumour) structures
Data
Pancreas Tumor/Cystic Lesions

Pancreas Data
Pancreas Tumors
Cystic Lesions (IPMN)
Pancreatic
Neuroendocrine Tumor

Target: Colon tumor
Modality: CT
Challenge: Heterogeneous appearance, very hard to annotate
Data
Colon Tumor

Target: Hepatic vessels and tumour
Modality: CT
Challenge: Tubular small structures next to heterogeneous
tumour
Data
Hepatic Vessels

Liver Data
Cholangiocarcinoma
Hepatocellular
Carcinoma
Hepatomas,
adenomas, FNHs

Target: Spleen
Modality: CT
Challenge: Large ranging foreground size
Data
Spleen

Targets: 10 anatomies
Modalities: CT, MRI
Size: 2,634 unique patients (1,746 Training + 888 Testing)
Data Summary

• Do not touch the challenge data! Come up with ranking scheme before the challenge.
• Experiments were performed with 56 segmentation competitions for which per case
data was available
• You can read a lot of the details in [1]
Strategy for generation of ranking scheme
[1] Maier-Hein, et al.:Is the winner really the best? A critical analysis of common research practice in biomedical image analysis competitions, ArXiv:180602051 (2018)

Research Question 1:
Do rankings vary (substantially) with the ranking
scheme?

Metric-based vs rank-based aggregation
• Two most commonly applied ranking schemes:

• Compare two rankings with Kendall‘s tau
onormalizes for number of participants
How to compare rankings?
Original Variant 1 Variant 2 Variant 3
1 6 6 2
2 5 2 1
3 4 3 3
4 3 4 4
5 2 5 5
6 1 1 6
tau = 0.9tau = -0.2tau = -1.0

Results for illustration
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11
A12
A13

Results for illustration
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11 / A12
A13
A14
A15
A16
A17

Key Findings: Rankings change drastically with
• Metric (variant)
• Aggregation method
• Observer who annotated
RQ1: Results

Research Question 2:
Does the robustness of rankings vary with the ranking
scheme?

• Bootstrapping with case resampling
oRepeat 1000 times:
- Resample all metric values with replacement. Size of the resample must be
equal to the number of test cases.
- Compute a new ranking based on the modified set of results
- Determine the challenge winner
oCompute % of cases where the winner stayed the winner
oCompute % of algorithms that were ranked first in at least 1% of the simulations.
How to measure ranking stability?

Metric-based aggregation with mean most robust
Maier-Hein, et al.:Is the winner really the best? A critical analysis of common research practice in biomedical image analysis competitions, ArXiv:180602051 (2018)
Conclusion:
Metric-based aggregration (aggregate then rank) with mean is most
robust ranking scheme from commonly applied ones

Done?
• Conclusion?Take metric-based aggregation with mean as standard method
• Some problems:
oMissing value handling statistically suboptimal when using the mean
(because it depends on the “punishing value“)
oArbitrarily small difference in aggregated metric values results in different ranks
oStatistical tests not straightforward to apply because pairwise comparisons may
result in „inconsistent“ rankings

Significance Ranking
ci
a1
ti1
mNm
(a1, ti1)
...
...
tiNi
m1(a1, tiNi
)
mNm
(a1, tiNi
)
...aNA
ti1
mNm
(aNA
, ti1)
...
...
tiNi
m1(aNA
, tiNi
)
mNm
(aNA
, tiNi
)
...
...
2. Pairwise comparisons of
algorithms
Wilcoxon signed rank test:
mj(al, tik) - mj(al‘, tik)
3. Significance scoring
sij(al): Number of algorithms
performing significantly
worse than al according to mj
4. Significance ranking
Rank algorithms according
to sij(ax), x = 1,...,NA
=> Highest score: Rank 1
1. Performance
assessment per
case tik
m1(a1, ti1)
m1(aNA
, ti1)
_
m1(aNA
, ti1)
m1(aNA
, ti1)
Proposed

• Robustness similar to that of metric-based aggregation
• Significance level of statistical test has minor influence (default 0.05)
• More shared places compared to metric-based aggregation
o1 winner: 73% / 100%
o2 winners: 14% / 0%
o3 winners: 11% / 0%
o4 winners: 2% / 0%
Experiments with significance ranking

Thank you!
• News on the challenge initiative will be tweeted:
• @cami_dkfz #BiomedicalChallenges
Further reading: Maier-Hein, et al.:Is the winner really the best? A critical analysis of common research practice in
biomedical image analysis competitions,ArXiv:180602051 (2018)

Other metrics considered
• Robust Dice (ignoring border region)
• Volume Difference (RMSE)
• Voxel classification statistics
o True positives
o False positives
o True negatives
o False negatives
Volume-based Performance Metrics
M1
M2
Dice Sørensen Coefficient

Surface-based Performance Metrics
Gold Standard
Model
prediction
Standard surface metrics
• Maximum surface distance
( a.k.a Hausdorff)
• 95 percentile of surface distances
(Hausdorff95)
• Mean surface distance
• Median surface distance

Surface DSC*
acceptable
deviation τ
Gold Standard
Model
prediction
• Acceptable deviation τ defined by
data set provider
• Green: acceptable surface parts
(distance between surfaces ≤ τ)
• Pink: unacceptable parts of the
surfaces (distance between surfaces >
τ ).
• Surface DSC measures the fraction of
acceptable surface parts
*called Normalised Surface Distance (NSD) in the challenge

• Blue: ground truth
• Green: predicted
• Yellow: distances ≤ τ
• Red: distances > τ
• τ = 0.2
• For illustration only
showing distances
from predicted
contour to ground
truth countour
Surface DSC Illustration
Hausdorff: 0.14
Hausdorff95: 0.13
Surface DSC: 100%
Hausdorff: 1.38
Hausdorff95: 0.65
Surface DSC: 88%
Hausdorff: 0.61
Hausdorff95: 0.58
Surface DSC: 60%

Surface DSC Computation
total surface
area
0.0 = no overlap
1.0 = perfect overlap
"overlapping"
surface area
Raw mask Surface
Border region
at tolerance τ
Forward and backward
“overlap”

Dice scores
Bjoern Menze - TU München

BrainTumor

Heart

Liver

Hippocampus

Prostate

Lung

Pancreas

Best performances per task
BRATS_1 "lupin" "0.884"
BRATS_2 "Isensee" "0.733"
la_1 "lupin" "0.968"
liver_1 "lupin" "0.983"
liver_2 "Isensee" "0.884"
hippocamp._1 "Isensee" "0.98"
prostate_1 "Isensee" "0.958"
lung_1 "Isensee" "0.691"
pancreas_1 "Isensee" "0.954"
hepaticvessel_1 "Isensee" "0.834"
spleen_1 "phil666" "0.997"
colon_1 "Isensee" "0.678"
Volume Dice Boundary Dice
BRATS_1 "CerebriuDIKU" "0.695"
la_1 "Isensee" "0.928"
hippocamp_2 "Isensee" "0.889"
lung_1 "Isensee" "0.692"
spleen_1 "beomheep" "0.967"
colon_1 "Isensee" "0.562"

Phase 1Winner
Fabian Isensee
German Cancer Research Center (DKFZ),Team nnU-Net

Hepatic vessel

Spleen

Colon

Runner-Up
BeomHee Park
Asan Medical Center,Team beomheep

Runner-Up
Yingda Xia
Johns Hopkins University/NVIDIA,Team NVDLMED

Phase 2Winner
Fabian Isensee
German Cancer Research Center (DKFZ),Team nnU-Net

Ranking Phase 1 & 2

Discussion and Closing Remarks

Panel Discussion
• Even more varied data?
• Complex vs simple ranking/metrics?
• Freedom vs Dockerization?
• Fairness in Compute Power?

Closing Remarks
• Article publication
o Data
o Full Paper
• Re-open submission system
o Monthly ranking release
o Effort to avoid overfitting
o Standard Benchmark for Segmentation
• The future
o We aim to achieve the dodecathlon 2019 (20 tasks)
o RSNA/ACR data validation

Medical Segmentation Decathalon

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Medical Segmentation Decathalon

Similar to Medical Segmentation Decathalon (20)

More from imgcommcall

More from imgcommcall (20)

Recently uploaded

Recently uploaded (20)

Medical Segmentation Decathalon