Weakly Supervised Whole Slide Image Analysis Using Cloud Computing

馭繁為簡
- 在 AI 雲上實現高效能的弱監
督式數位病理影像分析
Sean & C.C Chen

Digital Pathology Applications - Cancer Screening
• Goals
– Whether this Whole Slide Image (WSI) has cancer cells
– Location of cancer cells?
100,000 pixels
80,000 pixels
Cancer /
Non-Cancer
---
Segments1
Segments2
…
512 pixels
512 pixels
Annotations

Cancer Screening
Traditional Method：Patch based model
Annotations  Patching  Training
Two-Level AI Model for
Cancer Detection on
Whole Slide Image

Cancer Screening
Cancer Detection on
Whole Slide Image
Patch-level model (>10M
Patches)
Background, Benign, Cancer
Performance
- Accuracy: 98%
- AUC: 0.99
Ground Truth : Cancer, Normal Tissue
Shadowed area : Cancer predicted by AI

Cancer Screening
Cancer Detection on
Whole Slide Image
Patch-level model (>10M
Patches)
Background, Benign, Cancer
Performance
- Accuracy: 98%
- AUC: 0.99
Slide-level model
260 Training, 100 Testing
Performance
- Accuracy: 97%
- AUC: 0.98
Benign or NPC ?
Ground Truth : Cancer, Normal Tissue
Shadowed area : Cancer predicted by AI

Cancer Screening

Cancer Screening - Patch based model
Pros Cons
Good Performance
Good Localization
Two stages model
Need “LOTS OF”
segment annotations
 1hr/slide
 Confused
annotations
Low inference speed
 15-30 mins/slide

標註難
專業的標註更難
如何既降低專業標註需求又維持 AI 的品質？

• Given a image-level ground-truth
• Where is the target?
Weakly Supervised Learning
[Gupta et.al, 2019]

Positive
Positive
Negative
Positive
Image Label
Cancer region？ (if any)

WSOD through Class Activation Map
[Rajpurkar et.al, 2018]
[Zhou et.al, 2016]
[Gondal et.al, 2017]

Positive
Positive
Negative
Positive
Image Label Residual Networks
Positive?
Class Activation Map
[He et.al, 2016]
[Zhou et.al, 2016]

Issue of Out-Of-Memory
Assume a WSI (60k x 60k)
• It will take ~ 320 Gb Memory
GPU Memory/GPU Cores
Memory
Bandwidth
Quadro RTX
8000
48 GB 4608 672 GB/sec
Tesla V100 v2 32 GB 5120 900 GB/sec
Tesla V100 16 GB 5120 900 GB/sec
P100 16 GB 3584 732 GB/sec
P40 24GB 3840 346 GB/sec
K80 24GB 4992 480 GB/sec
GPU cannot hold the super-resolution images

Memory
Bandwidth
Quadro RTX
8000
48 GB 4608 672 GB/sec
P100 16 GB 3584 732 GB/sec
P40 24GB 3840 346 GB/sec
K80 24GB 4992 480 GB/sec
• Unified Memory：Use system RAM to keep parts of
model/weights and load them onto GPU RAM when
needed.

• Unified Memory：Use system RAM to keep parts of
model/weights and load them onto GPU RAM when
needed.
We need a server that has
• Large system RAM & Storage
• Fast CPU-GPU transfer
Memory
Bandwidth
Quadro RTX
8000
48 GB 4608 672 GB/sec
P100 16 GB 3584 732 GB/sec
P40 24GB 3840 346 GB/sec
K80 24GB 4992 480 GB/sec

How Taiwania2 Help us
8x GPU (V100-v2, 32Gb)
768 Gb RAM
2 NUMA nodes
8x GPU (V100-v2, 32Gb)
768 Gb RAM
2 NUMA nodes
Storage (6T)
Login Node
…
4x Infiniband
Scheduling：Slurm
Comm：OpenMPI & Horovod
Framework：Tensorflow

Ultra-Patch Workflow
1112 images (positive:557; negative: 555)
QuantaGrid D52G nodes on Taiwania2
8 Tesla V100 (32gb) and 768 Gb system memory per node
With batch size = 1
~320 Gb system memory is used for training through Unified Memory
Training：0.0067 Image/sec. (each update takes 2.5 mins)
Inferencing：~20 sec/WSI
Hardware
Data
Parameter
Time

Ultra-Patch Method - Result
• Compared to the two-stages model,
our Ultra-Patch Method with 10k
inputs got a competitive result
• Compared to different input-size, larger
inputs generally got better results

• Unified Memory
• Transfer cost between CPU-GPU
• Group-Execution and Group-Prefetching
Optimization for training speed
Load Tensors Do Compute Load Tensors …
…
Load Tensors Do Compute Load Tensors
Load Tensors Do Compute
…
Typical workflow
With Prefetching

Typical workflow
Optimized workflow GPU Page Faults Reduced

• Unified Memory
• Transfer cost between CPU-GPU
• Group-Execution and Group-Prefetching
• Parallel Training
• Distributed training through Horovod
Load Tensors Do Compute Load Tensors …
…
Load Tensors Do Compute Load Tensors
Load Tensors Do Compute
…
Typical workflow
With Prefetching

Ultra-Patch workflow
1112 images (positive:557; negative: 555)
QuantaGrid D52G nodes on Taiwania2
8 Tesla V100 (32gb) and 768 Gb system memory per node
With batch size = 4 (mixed-precision)
~700 Gb system memory is used for training through Unified Memory
Training：1.06 Image/sec
Inferencing：~20 sec/WSI
Hardware
Data
Parameter
Time

• ~5.11x acceleration on a single GPU optimization.
• ~147.28x acceleration on 8 nodes
• ~498x acceleration on 32 nodes
Experiments & Optimization Results
8 nodes
1 nodes
Data-size comparison
• More data, better
performance.

Impact of Ultra-Patch On Medical Image Analysis
• Speed up the developing time for EACH cancer
classifier/detector
• Laborious annotation tasks will be reduced
• Results will be available faster

• 國家高速網路中心提供台灣杉二號運算環境與 HPC 技術支援
• 林口長庚醫院提供訓練與測試資料集
Acknowledgement

Weakly Supervised Whole Slide Image Analysis Using Cloud Computing

Recommended

Recommended

More Related Content

Similar to Weakly Supervised Whole Slide Image Analysis Using Cloud Computing

Similar to Weakly Supervised Whole Slide Image Analysis Using Cloud Computing (20)

More from Sean Yu

More from Sean Yu (10)

Recently uploaded

Recently uploaded (20)

Weakly Supervised Whole Slide Image Analysis Using Cloud Computing