4. Paper 1: Bilateral-ViT for Robust Fovea
Localization (ISBI22 Best Paper Finalist)
The fovea is a key anatomical location in the retina. Visual
acuity is highest in the fovea region.
5. Challenges of Robust fovea localization
• The fovea is normally a darker spot however its local appearance can
be complicated by retina diseases.
• The shape of the blood vessel provides very useful global image
structure info (long-range) for fovea localization.
6. Methods
• A transformer-based network taking
in both retina image and vessel
segmentation mask for robust fovea
localization.
• Overall architecture is a U2Net
structure with customizations.
• We formulate the fovea localization
as a segmentation problem and the
loss function is a dice loss + cross
entropy.
7. Methods – Main Branch and Vessel Branch
• Main branch taking into
the retina images and
encoding the feature using
both CNN-feature blocks
and transformer block
• Vessel branch taking into
the vessel segmentation
mask with different scale
(size) and performing
feature encoding
8. Methods – Fusion Branch
• Fusion branch merges the
image feature and vessel
feature in a multi-scale
manner.
• The fusion branch and
vessel branch feature
blocks are also U-Net-like.
• Hence the overall
network is a nested U-net
(U2-Net).
9. Experiments
• Performed much better compared to U2Net (Pure CNN Based) Or
TransUNet, especially on diseased images.
• Our network = TransUNet + U2Net + Customized vessel fusion =
Good performance
11. Paper 2 - RTNet: Relation Transformer Network for
Diabetic Retinopathy Multi-lesion Segmentation
(TMI 2022)
Lesion segmentation in retina images
by considering the interaction among
different lesions, and interaction
between lesions and blood vessels.
12. Methods
• Input is a retina image. Outputs are lesion masks segmentation results (multi-class) and a vessel
mask.
• Vessel mask is only an auxiliary branch used only in training to provide the vessel supervisory signals.
In the training, the vessel mask pseudo ground-truth is provided by another vessel segmentation
trained model.
• Loss function is simply standard pixel-wise cross-entropy segmentation loss
13. Methods – Global Block
• Global branch models the global spatial attention for each channel, such that small
lesions/small structures can be highlighted via global spatial attention.
• Two separate global branches for vessel features and lesions feature respectively.
14. Methods – Relation Block
• Relation blocks are just standard transformer blocks modeling long-range spatial interactions.
• Self-attention block models the interaction among different lesions
• Cross attention block models the interaction between lesion features and vessel features
15. Experiments –Ablation Study
Most important result: the ablation study to prove relationship blocks are
effective!
Cross attention head is super-important for MA(小出血点)and SE(软渗).
Both MA and SE are difficult-to-segment lesions that might be helped by the
presence of blood vessels/other lesions. Motivation verified.
16. Experiment – Relation Blocks Attention
Visualization
As we can see self-attention highlights some other potential lesion
regions, and cross attention highlights blood vessels.
17. Experiments – SOTA comparisons
• Same datasets experiments: performance is simply so-so. OK but not impressive.
18. Experiments – Cross Dataset
• Significantly beat the competitors on cross-dataset settings.
19. Conclusion – Part 1
• Long-range interactions, and hence, the applications of transformers,
are indeed important for many medical tasks, like fovea localization
and lesion segmentation.
• Medicial papers: strong clinical background and certain experiments
(like cross dataset settings and attention visualization) would be
impressive.
• Next: long-range interactions are expensive quadratic complexity,
especially for 3D settings -> Efficient transformers
21. Paper 1: Self-Supervised Pre-Training of Swin
Transformers for 3D Medical Image Analysis
(NVIDIA, CVPR22)
• SOTA performance on MSD and
BTCV benchmark.
• MSD: a comprehensive
benchmark of 10 segmentation
tasks for both CT and MRI
• BTCV: abdomen segmentation
challenges for 13 organs.
• Code and pre-trained model
available!
22. Why the performance can be so good?
• Applies their customized SSL tricks: Masked Volume Inpainting (cutoff
augmentation + reconstruction loss), Image Rotation (classifying
rotation angles), and Contrastive coding
• Pretraining on 5,050 publicly available CT images from various
applications
• SOTA model architecture (Swin UNETR)
23. Swin UNETR
• Encoder: a series of Swin transform blocks + down sampling
• Decoder: resnet blocks + upsampling
• Overall: a Unet-like architecture
24. Swin Transformer Blocks
• Divided the 3D tokens into subwindows and only calculate the self-attention
within each subwindow. That is, we do not calculate global attention and only
calculate local attention.
• Global attention can be modeled in deeper network layers
• To avoid boundary issues, use shifted windowing mechanism.
25. Paper 2: CoTr: Efficiently Bridging CNN and
Transformer for 3D Medical Image Segmentation
(MICCAI 21)
• A hybrid CNN – transformer approach
• We mainly want to know how it uses the deformable transformer (DeTrans)
for efficiently modeling long-range interactions.
26. DEFORMABLE DETR (DETR): DEFORMABLE
TRANSFORMERS FOR END-TO-END OBJECT
DETECTION
• Do not perform long-range interaction from query pixel to all image pixels.
• Instead, sample a smaller number of image positions (learned sampling
offsets), and only calculate attention on the sampled image positions.
27. Conclusion – Part 2
• Some established efficient transformer techniques (Swin and
deformable sampling)
• Swin UNETR: a strong baseline for starting your medical
segmentation projects.
28. Discussion
Why do you think transformer or long-range interaction helps in your
machine learning projects?