7. Digital Pathology Applications - Cancer Screening
• Goals
– Whether this Whole Slide Image (WSI) has cancer cells
– Location of cancer cells?
100,000 pixels
80,000 pixels
Cancer /
Non-Cancer
---
Segments1
Segments2
…
512 pixels
512 pixels
Annotations
9. Cancer Screening
Traditional Method:Patch based model
Annotations Patching Training
Two-Level AI Model for
Cancer Detection on
Whole Slide Image
Patch-level model (>10M
Patches)
Background, Benign, Cancer
Performance
- Accuracy: 98%
- AUC: 0.99
Ground Truth : Cancer, Normal Tissue
Shadowed area : Cancer predicted by AI
10. Cancer Screening
Traditional Method:Patch based model
Annotations Patching Training
Two-Level AI Model for
Cancer Detection on
Whole Slide Image
Patch-level model (>10M
Patches)
Background, Benign, Cancer
Performance
- Accuracy: 98%
- AUC: 0.99
Slide-level model
260 Training, 100 Testing
Performance
- Accuracy: 97%
- AUC: 0.98
Benign or NPC ?
Ground Truth : Cancer, Normal Tissue
Shadowed area : Cancer predicted by AI
13. Cancer Screening - Patch based model
Pros Cons
Good Performance
Good Localization
Two stages model
Need “LOTS OF”
segment annotations
1hr/slide
Confused
annotations
Low inference speed
15-30 mins/slide
15. • Given a image-level ground-truth
• Where is the target?
Weakly Supervised Learning
[Gupta et.al, 2019]
16. Digital Pathology Applications - Cancer Screening
Positive
Positive
Negative
Positive
Image Label
Cancer region? (if any)
17. WSOD through Class Activation Map
[Rajpurkar et.al, 2018]
[Zhou et.al, 2016]
[Gondal et.al, 2017]
18. Digital Pathology Applications - Cancer Screening
Positive
Positive
Negative
Positive
Image Label Residual Networks
Positive?
Class Activation Map
[He et.al, 2016]
[Zhou et.al, 2016]
19. Issue of Out-Of-Memory
Assume a WSI (60k x 60k)
• It will take ~ 320 Gb Memory
GPU Memory/GPU Cores
Memory
Bandwidth
Quadro RTX
8000
48 GB 4608 672 GB/sec
Tesla V100 v2 32 GB 5120 900 GB/sec
Tesla V100 16 GB 5120 900 GB/sec
P100 16 GB 3584 732 GB/sec
P40 24GB 3840 346 GB/sec
K80 24GB 4992 480 GB/sec
GPU cannot hold the super-resolution images
20. Issue of Out-Of-Memory
Assume a WSI (60k x 60k)
• It will take ~ 320 Gb Memory
GPU Memory/GPU Cores
Memory
Bandwidth
Quadro RTX
8000
48 GB 4608 672 GB/sec
Tesla V100 v2 32 GB 5120 900 GB/sec
Tesla V100 16 GB 5120 900 GB/sec
P100 16 GB 3584 732 GB/sec
P40 24GB 3840 346 GB/sec
K80 24GB 4992 480 GB/sec
GPU cannot hold the super-resolution images
• Unified Memory:Use system RAM to keep parts of
model/weights and load them onto GPU RAM when
needed.
21. Issue of Out-Of-Memory
GPU cannot hold the super-resolution images
• Unified Memory:Use system RAM to keep parts of
model/weights and load them onto GPU RAM when
needed.
We need a server that has
• Large system RAM & Storage
• Fast CPU-GPU transfer
Assume a WSI (60k x 60k)
• It will take ~ 320 Gb Memory
GPU Memory/GPU Cores
Memory
Bandwidth
Quadro RTX
8000
48 GB 4608 672 GB/sec
Tesla V100 v2 32 GB 5120 900 GB/sec
Tesla V100 16 GB 5120 900 GB/sec
P100 16 GB 3584 732 GB/sec
P40 24GB 3840 346 GB/sec
K80 24GB 4992 480 GB/sec
22. How Taiwania2 Help us
8x GPU (V100-v2, 32Gb)
768 Gb RAM
2 NUMA nodes
8x GPU (V100-v2, 32Gb)
768 Gb RAM
2 NUMA nodes
Storage (6T)
Login Node
…
4x Infiniband
Scheduling:Slurm
Comm:OpenMPI & Horovod
Framework:Tensorflow
23. Ultra-Patch Workflow
1112 images (positive:557; negative: 555)
QuantaGrid D52G nodes on Taiwania2
8 Tesla V100 (32gb) and 768 Gb system memory per node
With batch size = 1
~320 Gb system memory is used for training through Unified Memory
Training:0.0067 Image/sec. (each update takes 2.5 mins)
Inferencing:~20 sec/WSI
Hardware
Data
Parameter
Time
24. Ultra-Patch Method - Result
• Compared to the two-stages model,
our Ultra-Patch Method with 10k
inputs got a competitive result
• Compared to different input-size, larger
inputs generally got better results
25. Ultra-Patch Method - Result
• Compared to the two-stages model,
our Ultra-Patch Method with 10k
inputs got a competitive result
• Compared to different input-size, larger
inputs generally got better results
28. • Unified Memory
• Transfer cost between CPU-GPU
• Group-Execution and Group-Prefetching
Optimization for training speed
Load Tensors Do Compute Load Tensors …
…
Load Tensors Do Compute Load Tensors
Load Tensors Do Compute
…
Typical workflow
With Prefetching
30. • Unified Memory
• Transfer cost between CPU-GPU
• Group-Execution and Group-Prefetching
• Parallel Training
• Distributed training through Horovod
Optimization for training speed
Load Tensors Do Compute Load Tensors …
…
Load Tensors Do Compute Load Tensors
Load Tensors Do Compute
…
Typical workflow
With Prefetching
31. Ultra-Patch workflow
1112 images (positive:557; negative: 555)
QuantaGrid D52G nodes on Taiwania2
8 Tesla V100 (32gb) and 768 Gb system memory per node
With batch size = 4 (mixed-precision)
~700 Gb system memory is used for training through Unified Memory
Training:1.06 Image/sec
Inferencing:~20 sec/WSI
Hardware
Data
Parameter
Time
32. • ~5.11x acceleration on a single GPU optimization.
• ~147.28x acceleration on 8 nodes
• ~498x acceleration on 32 nodes
Experiments & Optimization Results
8 nodes
1 nodes
Data-size comparison
• More data, better
performance.
33. Impact of Ultra-Patch On Medical Image Analysis
• Speed up the developing time for EACH cancer
classifier/detector
• Laborious annotation tasks will be reduced
• Results will be available faster