SlideShare a Scribd company logo
1 of 24
Dynamic Graph Enhanced
Contrastive Learning for Chest
X-ray Report Generation
Tien-Bach-Thanh Do
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: osfa19730@catholic.ac.kr
2024/04/22
Mingjie Li et al.
CVPR 2023
2
Introduction
● Automatic radiology reporting to relieve radiologists from heavy workloads and improve diagnosis
interpretation
● Researchers have enhanced data-driven neural networks with medical knowledge graphs to
eliminate the severe visual and textual bias
● Fixed graphs cannot guarantee the most appropriate scope of knowledge and limit the
effectiveness
● Propose a knowledge graph with Dynamic structure and nodes to facilitate chest X-ray report
generation with Contrastive Learning (DCL)
3
Introduction
Fig. 1. An
illustration of
one sample from
MIMIC-CXR and
the pre-
constructed
graph, where
blue circle,
orange boxes
and black boxes
refer to the
global node,
organ-level
entities and key
findings,
respectively.
The red dash
line represents
the unconnected
relation
4
Related Work
Medical Report Generation Meets Knowledge Graph
● [20] collected abnormalities from MIMIC-CXR dataset, and utilized them as the node of their proposed
abnormality graph, edges are the attention weights from source nodes to target nodes in the graph
● [26, 48] graph is adopted as prior medical knowledge to enhance the generation procedure
● [32] same as above and can facilitate the unsupervised learning framework
● [46, 30] adopted a universal graph with 20 entities that linked to the same organ are connected to each
other in the graph, relationships are modelled into adjacency matrix and utilized to propagate messages
in a graph encoding module, which is in a fixed structure => cannot guarantee the appropriate scope of
knowledge for some cases (missing or unconnected common entities)
● [30] utilized the global representations from pre-retrieved reports from the training corpus to model
domain-specific knowledge
● [24] constructed a clinical graph by extracting structural information from the training corpus via a NLP-
rule-based algorithm, restored a set of triplets for each case to model the domain-specific knowledge and
replace the visual representations
[20] Christy Y Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Knowledge-driven encode, retrieve, paraphrase for medical image report generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6666–6673,
2019
[26] Mingjie Li, Rui Liu, Fuyu Wang, Xiaojun Chang, and Xiaodan Liang. Auxiliary signal-guided knowledge encoder- decoder for medical report generation. World Wide Web, pages 1–18, 2022
[48] Haifeng Zhao, Jie Chen, Lili Huang, Tingting Yang, Wan- hai Ding, and Chuanfu Li. Automatic generation of medical report with knowledge graph. In Proceedings of the International Conference on Computing and Pattern Recognition, pages
1–1, 2021
[32] Fenglin Liu, Chenyu You, Xian Wu, Shen Ge, Xu Sun, et al. Auto-encoding knowledge graph for unsupervised medical report generation. In Proceedings of the Conference on Neural Information Processing Systems, pages 16266–16279,
2021
[46] Yixiao Zhang, Xiaosong Wang, Ziyue Xu, Qihang Yu, Alan Yuille, and Daguang Xu. When radiology report generation meets knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12910–
12917, 2020
[30] Fenglin Liu, Xian Wu, Shen Ge, Wei Fan, and Yuexian Zou. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
13753–13762, 2021
[24] Mingjie Li, Wenjia Cai, Karin Verspoor, Shirui Pan, Xiao- dan Liang, and Xiaojun Chang. Cross-modal clinical graph transformer for ophthalmic report generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 20656–20665, 2022
5
Related Work
Contrastive Learning (improve RL by contrasting positive/negative or similar/dissimilar pairs )
● [43] developed a weakly supervised method to contrast target reports with incorrect ones by identifying
“hard” negative samples
● [31] compared the referring image with known normal images via a contrastive attention mechanism to
detect the abnormal regions
● [5, 42, 45] employed contrastive learning during the pre training process to better represent visual
features and textual information
[43] An Yan, Zexue He, Xing Lu, Jiang Du, Eric Chang, Amilcare Gentili, Julian McAuley, and Chun-nan Hsu. Weakly supervised contrastive learning for chest x-ray report generation. In Findings of the Association for Computational Linguistics:
EMNLP 2021, pages 4009–4015, 2021
[31] Fenglin Liu, Changchang Yin, Xian Wu, Shen Ge, Ping Zhang, and Xu Sun. Contrastive attention for automatic chest x-ray report generation. In Findings of the Association for Computational Linguistics, pages 269–280, 2021
[5] Yu-Jen Chen, Wei-Hsiang Shen, Hao-Wei Chung, Jing-Hao Chiu, Da-Cheng Juan, Tsung-Ying Ho, Chi-Tung Cheng, Meng-Lin Li, and Tsung-Yi Ho. Representative image feature extraction via contrastive learning pre training for chest x-ray
report generation. arXiv preprint arXiv:2209.01604, 2022
[42] Xing Wu, Jingwen Li, Jianjia Wang, and Quan Qian. Multimodal contrastive learning for radiology report generation. Journal of Ambient Intelligence and Humanized Computing, pages 1–10, 2022
[45] Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747, 2020
6
Methodology
DCL contains 2 unimodal encoders, a multimodal encoder, a report decoder and 3 dynamic graph modules for construction, encoding and cross-modal
attention, respectively. Report Generation (RG) loss, Image-Report Contrastive (IRC) loss and Image-Report Matching (IRM) loss are adopted for training DCL
7
Methodology
Background
● Notation, aim to leverage dynamic graphs to enhance contrastive learning for radiology reporting
● Given medical image I with a free-text report T = {y1, y2, …, yn}, denote the target report by
where n and represent the numbers of tokens in a report
● Dynamic graph denoted by G = {V, E} where V and E are the sets of nodes and edges, respectively, is
built on a pre-constructed graph Gpre proposed in [46] and updated with specific knowledge
extracted from retrieved reports . Each k is stored in a triplet format,
which consists of a subjective entity es, and objective entity eo and their relation r. nT and nK are the
numbers of reports and triplets, respectively. All the triplets are acquired from the RadGraph [17]
● Adopt the Transformer [39] as the backbone to generate the long and robust reports
[46] Yixiao Zhang, Xiaosong Wang, Ziyue Xu, Qihang Yu, Alan Yuille, and Daguang Xu. When radiology report generation meets knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12910–
12917, 2020
[17] Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, Curtis Langlotz, et al. Radgraph: Extracting clinical entities and relations from radiology reports. In
Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021
[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the Conference on Neural Information Processing Systems, 2017
8
Methodology
Background
● Image Encoder, only employ a ViT [10] pretrained on the ImageNet [9] as the image encoder to simplify
the architecture
● The input image divided into 196 patches, [CLS] token appended to the beginning of sequence before
being fed into the encoder layers. An encoder layer fe(.) be written:
where FFN and FN denoter Feed Forward and Layer Normalization operation, respectively. x is the input of
each encoder layer. MHA divides a scaled dot-product attention into n parallel heads and each head Att(.) can
be written as follows:
where d = 768 is the dimension of the embedding space and {Q, K*, V*} are the packed d-dimensional Query,
Key, Value vectors, respectively. The final output is the encoded visual vectors fI, which will be used for report
generation
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for
image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 248–255, 2009
9
Methodology
Background
● Report Decoder, consists of two Transformer decoder layers
where MMHA and CA represent the masked multi-head self-attention and cross attention mechanism. y is the
input of decoder. A [Decode] token is added to the beginning of y to signal the start while a [EOS] token is to
signal its end
● The fd(y) will be sent to a Linear & Log-Softmax layer to get the output of target sentences, auto-
regressive generation process can be written as follows
where yt is the input token in time step t
● Report generation objective is the cross-entropy loss to compare the predicted token index sequence
with the ground truth, the underlying modules are trained to maximize p(y|I) by minimizing the following:
10
Methodology
Dynamic Graph
Fig 3. Illustration of our proposed dynamic graph construction, dynamic graph encoder (DGE), and graph attention (GA) modules.
The structure of the pre-constructed graph can be found in Fig 1.
11
Methodology
Dynamic Graph
● Chest knowledge graph Gpre proposed in [46] has been widely integrated with CRG systems to
emphasize the disease keywords and enhance their relationships
● Gpre consists of 27 entities and a root node referring to the global feature and an adjacency matrix A =
{eij} to represent the edges V
● Each node is a disease keyword and set eij to 1 when source node ni connects target node nj
● Node linked to the same organ or tissue are connected to each other and the root
● This graph is not updated during training => limit the effectiveness from 2 aspects
○ Those entities can not cover the most common disease keywords for all datasets because of the
dataset bias
○ Entities linked to different organs can also affect each other clinically
● Dynamic graph G with dynamic structure and nodes and integrate it with visual features to generate high-
quality reports
[46] Yixiao Zhang, Xiaosong Wang, Ziyue Xu, Qihang Yu, Alan Yuille, and Daguang Xu. When radiology report generation meets knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12910–
12917, 2020
12
Methodology
Dynamic Graph (Construction)
● First construct the fundamental structure from general knowledge, add nodes or redefine their
relationships according to specific knowledge
● Extend Gpre with 28 entities consisting of a global node represented by a [CLS] token 7 organs or
tissues, and 20 disease keywords, every organ will connect to each other organ or tissue
● To get the specific knowledge for each given image I, retrieve the top-nT similar reports
by calculating the similarity between the visual feature fI and representations of reports in queue
where nQ is the length of the report queue
● For each retrieved report, extract anatomy and observation entities by Stanza [47]
● Those entities are further utilized to quote the specific knowledge KI from RadGraph [17]
● Each triplet k in KI aims to depict the relationship between source and target entities
● 3 kinds of relations: “suggestive of”, “modify” and “located at”
● For a triplet whose only source entity es or target entity eo is in the graph, add another entity in the triplet
as an additional node and set their relation to 1 in the A
● Dynamic graph is capable to exploit both general and specific knowledge
[47] Yuhao Zhang, Yuhui Zhang, Peng Qi, Christopher D Manning, and Curtis P Langlotz. Biomedical and clinical english model packages for the stanza python nlp library. Journal of the American Medical Informatics Association, 28(9):1892–
1899, 2021
[17] Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, Curtis Langlotz, et al. Radgraph: Extracting clinical entities and relations from radiology re- ports. In
Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021
13
Methodology
Dynamic Graph (Encoder)
● Encoder to propagate information and learn dedicated node features, module is built upon the standard
Transformer encoder layer fG
where RSA is an MMHA-like relational self-attention module to encode the structural information of a graph to
the model
● Utilize A as a visible mask [24, 33] to control the original MMHA, make sure each node can only impact
its linked nodes and enhance their relationships, fN is the initialized nodes representations and consists
of entity embedding and level encoding
● Adopt word embeddings єsci from well-trained SciBert [4] to initialize each entity
● Add level encoding єl to demonstrate each node is the root, organ, or disease keyword
● The structural information of graph is well represented and encoded during message passing and
propagation
[24] Mingjie Li, Wenjia Cai, Karin Verspoor, Shirui Pan, Xiao- dan Liang, and Xiaojun Chang. Cross-modal clinical graph transformer for ophthalmic report generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 20656–20665, 2022
[33] Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. K-bert: Enabling language representation with knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages
2901–2908, 2020
[4] Iz Beltagy, Kyle Lo, and Armaan Cohan. Scibert: A pre- trained language model for scientific text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 3615–3620, 2019
14
Methodology
Dynamic Graph (Graph Attention)
● Aim to integrate knowledge from dynamic graphs with visual features
● Following [30], utilize cross attention to achieve this goal
● In each head, Query comes from visual features fI while Key and Value come from the learned graph
representations fG
● Get the dynamic graph enhanced visual features
● The first token in both and fG is [CLS] to aggregate visual and graph information
15
Methodology
Contrastive Learning
● Collecting paired image and text data is prohibitively expensive leading to a smaller size of CRG datasets
compared with captioning datasets
● Introduce the image-report contrastive loss used in DCL, which can effectively improve the visual and
textual representations as well as ensure the report retrieval accuracy in the dynamic graph construction
process
● Inspired by recent vision-language pretraining works [21, 22], adopt an image-report matching loss in the
proposed method for further enhancing the representations to improve the performance
[21] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086, 2022
[22] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information
processing systems, 34:9694–9705, 2021
16
Methodology
Contrastive Learning (Image-Report Contrastive Loss)
● Activate radiology reporting by encouraging the positive image-report pairs to have similar
representations in contrast to the negative pairs
● Calculate the similar between 2 [CLS] representations by , where WI and WT
are 2 learnable matrices
● Maintain 2 queues to store the most recent M image-report representations from the momentum image
and report encoders
● After Softmax activation, image-report similarity , the report-image
● similarity is a learnable temperature parameter. The IRC can be written
where is the cross entropy loss and g(I) is the ground truth of image-report similarity
17
Methodology
Contrastive Learning (Image-Report Matching Loss)
● A binary classification task to predict whether the given image-report pair is positive (matched) or
negative (unmatched)
● Utilize a multimodal encoder to capture the multimodal representations via cross attention mechanism
● [Encode] vector is projected to d = 2 with a linear layer to predict the probability . The IRM is
conducted as
● Finally, calculate the sum of 3 losses as total loss function
● The multimodal encoder is only used during the training process to improve representation learning
18
Datasets
2 datasets
• IU-Xray
○ 3955 radiology reports and 7470 chest xray images
○ 2069/296/590 cases for training validation/testing
• MIMIC-CXR
○ 368960 chest X-ray images and 222758 radiology reports
• Natural Language Generation Metrics (NLG)
○ Used to measure the descriptive accuracy of predicted reports
• Clinical Efficacy Metrics
○ Capture and evaluate the clinical correctness of predicted reports
• Experimental settings
○ Use the same image encoder for different views images and concatenate visual tokens
19
Results
SOTA comparison
20
Results
Clinical Correctness
21
Results
Ablation study
22
Results
23
Results
24
Conclusion
• Present a practical approach to leverage dynamic graph to enhance contrastive learning for radiology
report generation
• Dynamic graph is constructed in a bottom-up manner to integrate retrieved specific knowledge with general
knowledge
• Contrastive learning is employed to improve visual and textual representations
• The approach outperform or match existing SOTA methods in language generation and clinical efficacy
metrics

More Related Content

Similar to 240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest X-ray Report Generation].pptx

A hybrid method for traumatic brain injury lesion segmentation
A hybrid method for traumatic brain injury lesion segmentationA hybrid method for traumatic brain injury lesion segmentation
A hybrid method for traumatic brain injury lesion segmentation
IJECEIAES
 
A novel-approach-for-retinal-lesion-detection-indiabetic-retinopathy-images
A novel-approach-for-retinal-lesion-detection-indiabetic-retinopathy-imagesA novel-approach-for-retinal-lesion-detection-indiabetic-retinopathy-images
A novel-approach-for-retinal-lesion-detection-indiabetic-retinopathy-images
SrinivasaRaoNaga
 
Spitz vs conventional
Spitz vs conventionalSpitz vs conventional
Spitz vs conventional
Steven Hart
 

Similar to 240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest X-ray Report Generation].pptx (20)

Classification of pathologies on digital chest radiographs using machine lear...
Classification of pathologies on digital chest radiographs using machine lear...Classification of pathologies on digital chest radiographs using machine lear...
Classification of pathologies on digital chest radiographs using machine lear...
 
Performance Comparison of Face Recognition Using DCT Against Face Recognition...
Performance Comparison of Face Recognition Using DCT Against Face Recognition...Performance Comparison of Face Recognition Using DCT Against Face Recognition...
Performance Comparison of Face Recognition Using DCT Against Face Recognition...
 
IRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
IRJET- A Survey on Medical Image Interpretation for Predicting PneumoniaIRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
IRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
 
Prediction of Cognitive Imperiment using Deep Learning
Prediction of Cognitive Imperiment using Deep LearningPrediction of Cognitive Imperiment using Deep Learning
Prediction of Cognitive Imperiment using Deep Learning
 
A NEW SELF-ADAPTIVE APPROACH FOR MEDICAL IMAGE SECURITY
A NEW SELF-ADAPTIVE APPROACH FOR MEDICAL IMAGE SECURITYA NEW SELF-ADAPTIVE APPROACH FOR MEDICAL IMAGE SECURITY
A NEW SELF-ADAPTIVE APPROACH FOR MEDICAL IMAGE SECURITY
 
Top Cited Article in Informatics Engineering Research: October 2020
Top Cited Article in Informatics Engineering Research: October 2020Top Cited Article in Informatics Engineering Research: October 2020
Top Cited Article in Informatics Engineering Research: October 2020
 
Brain Tumor Segmentation using Enhanced U-Net Model with Empirical Analysis
Brain Tumor Segmentation using Enhanced U-Net Model with Empirical AnalysisBrain Tumor Segmentation using Enhanced U-Net Model with Empirical Analysis
Brain Tumor Segmentation using Enhanced U-Net Model with Empirical Analysis
 
New Research Articles 2019 October Issue Signal & Image Processing An Interna...
New Research Articles 2019 October Issue Signal & Image Processing An Interna...New Research Articles 2019 October Issue Signal & Image Processing An Interna...
New Research Articles 2019 October Issue Signal & Image Processing An Interna...
 
Artificial Neural Networks applications in Computer Aided Diagnosis. System d...
Artificial Neural Networks applications in Computer Aided Diagnosis. System d...Artificial Neural Networks applications in Computer Aided Diagnosis. System d...
Artificial Neural Networks applications in Computer Aided Diagnosis. System d...
 
x-RAYS PROJECT
x-RAYS PROJECTx-RAYS PROJECT
x-RAYS PROJECT
 
A hybrid method for traumatic brain injury lesion segmentation
A hybrid method for traumatic brain injury lesion segmentationA hybrid method for traumatic brain injury lesion segmentation
A hybrid method for traumatic brain injury lesion segmentation
 
Parkinson’s Disease Detection Using Transfer Learning
Parkinson’s Disease Detection Using Transfer LearningParkinson’s Disease Detection Using Transfer Learning
Parkinson’s Disease Detection Using Transfer Learning
 
Extracting individual information using facial recognition in a smart mirror....
Extracting individual information using facial recognition in a smart mirror....Extracting individual information using facial recognition in a smart mirror....
Extracting individual information using facial recognition in a smart mirror....
 
A novel-approach-for-retinal-lesion-detection-indiabetic-retinopathy-images
A novel-approach-for-retinal-lesion-detection-indiabetic-retinopathy-imagesA novel-approach-for-retinal-lesion-detection-indiabetic-retinopathy-images
A novel-approach-for-retinal-lesion-detection-indiabetic-retinopathy-images
 
Rapid COVID-19 Diagnosis Using Deep Learning of the Computerized Tomography ...
Rapid COVID-19 Diagnosis Using Deep Learning  of the Computerized Tomography ...Rapid COVID-19 Diagnosis Using Deep Learning  of the Computerized Tomography ...
Rapid COVID-19 Diagnosis Using Deep Learning of the Computerized Tomography ...
 
DR PPT[1]2[1].pptx - Read-Only.pptx
DR PPT[1]2[1].pptx  -  Read-Only.pptxDR PPT[1]2[1].pptx  -  Read-Only.pptx
DR PPT[1]2[1].pptx - Read-Only.pptx
 
Spitz vs conventional
Spitz vs conventionalSpitz vs conventional
Spitz vs conventional
 
Ijebea14 276
Ijebea14 276Ijebea14 276
Ijebea14 276
 
IRJET- Brain Tumor Detection using Deep Learning
IRJET- Brain Tumor Detection using Deep LearningIRJET- Brain Tumor Detection using Deep Learning
IRJET- Brain Tumor Detection using Deep Learning
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 

More from thanhdowork

More from thanhdowork (20)

[20240520_LabSeminar_Huy]DSTAGNN: Dynamic Spatial-Temporal Aware Graph Neural...
[20240520_LabSeminar_Huy]DSTAGNN: Dynamic Spatial-Temporal Aware Graph Neural...[20240520_LabSeminar_Huy]DSTAGNN: Dynamic Spatial-Temporal Aware Graph Neural...
[20240520_LabSeminar_Huy]DSTAGNN: Dynamic Spatial-Temporal Aware Graph Neural...
 
240520_Thanh_LabSeminar[G-MSM: Unsupervised Multi-Shape Matching with Graph-b...
240520_Thanh_LabSeminar[G-MSM: Unsupervised Multi-Shape Matching with Graph-b...240520_Thanh_LabSeminar[G-MSM: Unsupervised Multi-Shape Matching with Graph-b...
240520_Thanh_LabSeminar[G-MSM: Unsupervised Multi-Shape Matching with Graph-b...
 
240513_Thanh_LabSeminar[Learning and Aggregating Lane Graphs for Urban Automa...
240513_Thanh_LabSeminar[Learning and Aggregating Lane Graphs for Urban Automa...240513_Thanh_LabSeminar[Learning and Aggregating Lane Graphs for Urban Automa...
240513_Thanh_LabSeminar[Learning and Aggregating Lane Graphs for Urban Automa...
 
240513_Thuy_Labseminar[Universal Prompt Tuning for Graph Neural Networks].pptx
240513_Thuy_Labseminar[Universal Prompt Tuning for Graph Neural Networks].pptx240513_Thuy_Labseminar[Universal Prompt Tuning for Graph Neural Networks].pptx
240513_Thuy_Labseminar[Universal Prompt Tuning for Graph Neural Networks].pptx
 
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
 
240506_JW_labseminar[Structural Deep Network Embedding].pptx
240506_JW_labseminar[Structural Deep Network Embedding].pptx240506_JW_labseminar[Structural Deep Network Embedding].pptx
240506_JW_labseminar[Structural Deep Network Embedding].pptx
 
[20240506_LabSeminar_Huy]Conditional Local Convolution for Spatio-Temporal Me...
[20240506_LabSeminar_Huy]Conditional Local Convolution for Spatio-Temporal Me...[20240506_LabSeminar_Huy]Conditional Local Convolution for Spatio-Temporal Me...
[20240506_LabSeminar_Huy]Conditional Local Convolution for Spatio-Temporal Me...
 
240506_Thanh_LabSeminar[ASG2Caption].pptx
240506_Thanh_LabSeminar[ASG2Caption].pptx240506_Thanh_LabSeminar[ASG2Caption].pptx
240506_Thanh_LabSeminar[ASG2Caption].pptx
 
240506_Thuy_Labseminar[GraphPrompt: Unifying Pre-Training and Downstream Task...
240506_Thuy_Labseminar[GraphPrompt: Unifying Pre-Training and Downstream Task...240506_Thuy_Labseminar[GraphPrompt: Unifying Pre-Training and Downstream Task...
240506_Thuy_Labseminar[GraphPrompt: Unifying Pre-Training and Downstream Task...
 
[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...
[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...
[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...
 
240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...
240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...
240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...
 
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
 
[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx
 
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
 
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
 
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
 
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
 
240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx
 
240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...
240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...
240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...
 
240122_Attention Is All You Need (2017 NIPS)2.pptx
240122_Attention Is All You Need (2017 NIPS)2.pptx240122_Attention Is All You Need (2017 NIPS)2.pptx
240122_Attention Is All You Need (2017 NIPS)2.pptx
 

Recently uploaded

會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
中 央社
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
CaitlinCummins3
 

Recently uploaded (20)

BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
 
e-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopale-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopal
 
Climbers and Creepers used in landscaping
Climbers and Creepers used in landscapingClimbers and Creepers used in landscaping
Climbers and Creepers used in landscaping
 
Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...
 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
 
The Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDFThe Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDF
 
8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management
 
MOOD STABLIZERS DRUGS.pptx
MOOD     STABLIZERS           DRUGS.pptxMOOD     STABLIZERS           DRUGS.pptx
MOOD STABLIZERS DRUGS.pptx
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
Book Review of Run For Your Life Powerpoint
Book Review of Run For Your Life PowerpointBook Review of Run For Your Life Powerpoint
Book Review of Run For Your Life Powerpoint
 
How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
 
How to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptxHow to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptx
 
demyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptxdemyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptx
 
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"
 
male presentation...pdf.................
male presentation...pdf.................male presentation...pdf.................
male presentation...pdf.................
 
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
 
An overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismAn overview of the various scriptures in Hinduism
An overview of the various scriptures in Hinduism
 

240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest X-ray Report Generation].pptx

  • 1. Dynamic Graph Enhanced Contrastive Learning for Chest X-ray Report Generation Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: osfa19730@catholic.ac.kr 2024/04/22 Mingjie Li et al. CVPR 2023
  • 2. 2 Introduction ● Automatic radiology reporting to relieve radiologists from heavy workloads and improve diagnosis interpretation ● Researchers have enhanced data-driven neural networks with medical knowledge graphs to eliminate the severe visual and textual bias ● Fixed graphs cannot guarantee the most appropriate scope of knowledge and limit the effectiveness ● Propose a knowledge graph with Dynamic structure and nodes to facilitate chest X-ray report generation with Contrastive Learning (DCL)
  • 3. 3 Introduction Fig. 1. An illustration of one sample from MIMIC-CXR and the pre- constructed graph, where blue circle, orange boxes and black boxes refer to the global node, organ-level entities and key findings, respectively. The red dash line represents the unconnected relation
  • 4. 4 Related Work Medical Report Generation Meets Knowledge Graph ● [20] collected abnormalities from MIMIC-CXR dataset, and utilized them as the node of their proposed abnormality graph, edges are the attention weights from source nodes to target nodes in the graph ● [26, 48] graph is adopted as prior medical knowledge to enhance the generation procedure ● [32] same as above and can facilitate the unsupervised learning framework ● [46, 30] adopted a universal graph with 20 entities that linked to the same organ are connected to each other in the graph, relationships are modelled into adjacency matrix and utilized to propagate messages in a graph encoding module, which is in a fixed structure => cannot guarantee the appropriate scope of knowledge for some cases (missing or unconnected common entities) ● [30] utilized the global representations from pre-retrieved reports from the training corpus to model domain-specific knowledge ● [24] constructed a clinical graph by extracting structural information from the training corpus via a NLP- rule-based algorithm, restored a set of triplets for each case to model the domain-specific knowledge and replace the visual representations [20] Christy Y Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Knowledge-driven encode, retrieve, paraphrase for medical image report generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6666–6673, 2019 [26] Mingjie Li, Rui Liu, Fuyu Wang, Xiaojun Chang, and Xiaodan Liang. Auxiliary signal-guided knowledge encoder- decoder for medical report generation. World Wide Web, pages 1–18, 2022 [48] Haifeng Zhao, Jie Chen, Lili Huang, Tingting Yang, Wan- hai Ding, and Chuanfu Li. Automatic generation of medical report with knowledge graph. In Proceedings of the International Conference on Computing and Pattern Recognition, pages 1–1, 2021 [32] Fenglin Liu, Chenyu You, Xian Wu, Shen Ge, Xu Sun, et al. Auto-encoding knowledge graph for unsupervised medical report generation. In Proceedings of the Conference on Neural Information Processing Systems, pages 16266–16279, 2021 [46] Yixiao Zhang, Xiaosong Wang, Ziyue Xu, Qihang Yu, Alan Yuille, and Daguang Xu. When radiology report generation meets knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12910– 12917, 2020 [30] Fenglin Liu, Xian Wu, Shen Ge, Wei Fan, and Yuexian Zou. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 13753–13762, 2021 [24] Mingjie Li, Wenjia Cai, Karin Verspoor, Shirui Pan, Xiao- dan Liang, and Xiaojun Chang. Cross-modal clinical graph transformer for ophthalmic report generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 20656–20665, 2022
  • 5. 5 Related Work Contrastive Learning (improve RL by contrasting positive/negative or similar/dissimilar pairs ) ● [43] developed a weakly supervised method to contrast target reports with incorrect ones by identifying “hard” negative samples ● [31] compared the referring image with known normal images via a contrastive attention mechanism to detect the abnormal regions ● [5, 42, 45] employed contrastive learning during the pre training process to better represent visual features and textual information [43] An Yan, Zexue He, Xing Lu, Jiang Du, Eric Chang, Amilcare Gentili, Julian McAuley, and Chun-nan Hsu. Weakly supervised contrastive learning for chest x-ray report generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4009–4015, 2021 [31] Fenglin Liu, Changchang Yin, Xian Wu, Shen Ge, Ping Zhang, and Xu Sun. Contrastive attention for automatic chest x-ray report generation. In Findings of the Association for Computational Linguistics, pages 269–280, 2021 [5] Yu-Jen Chen, Wei-Hsiang Shen, Hao-Wei Chung, Jing-Hao Chiu, Da-Cheng Juan, Tsung-Ying Ho, Chi-Tung Cheng, Meng-Lin Li, and Tsung-Yi Ho. Representative image feature extraction via contrastive learning pre training for chest x-ray report generation. arXiv preprint arXiv:2209.01604, 2022 [42] Xing Wu, Jingwen Li, Jianjia Wang, and Quan Qian. Multimodal contrastive learning for radiology report generation. Journal of Ambient Intelligence and Humanized Computing, pages 1–10, 2022 [45] Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747, 2020
  • 6. 6 Methodology DCL contains 2 unimodal encoders, a multimodal encoder, a report decoder and 3 dynamic graph modules for construction, encoding and cross-modal attention, respectively. Report Generation (RG) loss, Image-Report Contrastive (IRC) loss and Image-Report Matching (IRM) loss are adopted for training DCL
  • 7. 7 Methodology Background ● Notation, aim to leverage dynamic graphs to enhance contrastive learning for radiology reporting ● Given medical image I with a free-text report T = {y1, y2, …, yn}, denote the target report by where n and represent the numbers of tokens in a report ● Dynamic graph denoted by G = {V, E} where V and E are the sets of nodes and edges, respectively, is built on a pre-constructed graph Gpre proposed in [46] and updated with specific knowledge extracted from retrieved reports . Each k is stored in a triplet format, which consists of a subjective entity es, and objective entity eo and their relation r. nT and nK are the numbers of reports and triplets, respectively. All the triplets are acquired from the RadGraph [17] ● Adopt the Transformer [39] as the backbone to generate the long and robust reports [46] Yixiao Zhang, Xiaosong Wang, Ziyue Xu, Qihang Yu, Alan Yuille, and Daguang Xu. When radiology report generation meets knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12910– 12917, 2020 [17] Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, Curtis Langlotz, et al. Radgraph: Extracting clinical entities and relations from radiology reports. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021 [39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the Conference on Neural Information Processing Systems, 2017
  • 8. 8 Methodology Background ● Image Encoder, only employ a ViT [10] pretrained on the ImageNet [9] as the image encoder to simplify the architecture ● The input image divided into 196 patches, [CLS] token appended to the beginning of sequence before being fed into the encoder layers. An encoder layer fe(.) be written: where FFN and FN denoter Feed Forward and Layer Normalization operation, respectively. x is the input of each encoder layer. MHA divides a scaled dot-product attention into n parallel heads and each head Att(.) can be written as follows: where d = 768 is the dimension of the embedding space and {Q, K*, V*} are the packed d-dimensional Query, Key, Value vectors, respectively. The final output is the encoded visual vectors fI, which will be used for report generation [10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020 [9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 248–255, 2009
  • 9. 9 Methodology Background ● Report Decoder, consists of two Transformer decoder layers where MMHA and CA represent the masked multi-head self-attention and cross attention mechanism. y is the input of decoder. A [Decode] token is added to the beginning of y to signal the start while a [EOS] token is to signal its end ● The fd(y) will be sent to a Linear & Log-Softmax layer to get the output of target sentences, auto- regressive generation process can be written as follows where yt is the input token in time step t ● Report generation objective is the cross-entropy loss to compare the predicted token index sequence with the ground truth, the underlying modules are trained to maximize p(y|I) by minimizing the following:
  • 10. 10 Methodology Dynamic Graph Fig 3. Illustration of our proposed dynamic graph construction, dynamic graph encoder (DGE), and graph attention (GA) modules. The structure of the pre-constructed graph can be found in Fig 1.
  • 11. 11 Methodology Dynamic Graph ● Chest knowledge graph Gpre proposed in [46] has been widely integrated with CRG systems to emphasize the disease keywords and enhance their relationships ● Gpre consists of 27 entities and a root node referring to the global feature and an adjacency matrix A = {eij} to represent the edges V ● Each node is a disease keyword and set eij to 1 when source node ni connects target node nj ● Node linked to the same organ or tissue are connected to each other and the root ● This graph is not updated during training => limit the effectiveness from 2 aspects ○ Those entities can not cover the most common disease keywords for all datasets because of the dataset bias ○ Entities linked to different organs can also affect each other clinically ● Dynamic graph G with dynamic structure and nodes and integrate it with visual features to generate high- quality reports [46] Yixiao Zhang, Xiaosong Wang, Ziyue Xu, Qihang Yu, Alan Yuille, and Daguang Xu. When radiology report generation meets knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12910– 12917, 2020
  • 12. 12 Methodology Dynamic Graph (Construction) ● First construct the fundamental structure from general knowledge, add nodes or redefine their relationships according to specific knowledge ● Extend Gpre with 28 entities consisting of a global node represented by a [CLS] token 7 organs or tissues, and 20 disease keywords, every organ will connect to each other organ or tissue ● To get the specific knowledge for each given image I, retrieve the top-nT similar reports by calculating the similarity between the visual feature fI and representations of reports in queue where nQ is the length of the report queue ● For each retrieved report, extract anatomy and observation entities by Stanza [47] ● Those entities are further utilized to quote the specific knowledge KI from RadGraph [17] ● Each triplet k in KI aims to depict the relationship between source and target entities ● 3 kinds of relations: “suggestive of”, “modify” and “located at” ● For a triplet whose only source entity es or target entity eo is in the graph, add another entity in the triplet as an additional node and set their relation to 1 in the A ● Dynamic graph is capable to exploit both general and specific knowledge [47] Yuhao Zhang, Yuhui Zhang, Peng Qi, Christopher D Manning, and Curtis P Langlotz. Biomedical and clinical english model packages for the stanza python nlp library. Journal of the American Medical Informatics Association, 28(9):1892– 1899, 2021 [17] Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, Curtis Langlotz, et al. Radgraph: Extracting clinical entities and relations from radiology re- ports. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021
  • 13. 13 Methodology Dynamic Graph (Encoder) ● Encoder to propagate information and learn dedicated node features, module is built upon the standard Transformer encoder layer fG where RSA is an MMHA-like relational self-attention module to encode the structural information of a graph to the model ● Utilize A as a visible mask [24, 33] to control the original MMHA, make sure each node can only impact its linked nodes and enhance their relationships, fN is the initialized nodes representations and consists of entity embedding and level encoding ● Adopt word embeddings єsci from well-trained SciBert [4] to initialize each entity ● Add level encoding єl to demonstrate each node is the root, organ, or disease keyword ● The structural information of graph is well represented and encoded during message passing and propagation [24] Mingjie Li, Wenjia Cai, Karin Verspoor, Shirui Pan, Xiao- dan Liang, and Xiaojun Chang. Cross-modal clinical graph transformer for ophthalmic report generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 20656–20665, 2022 [33] Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. K-bert: Enabling language representation with knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 2901–2908, 2020 [4] Iz Beltagy, Kyle Lo, and Armaan Cohan. Scibert: A pre- trained language model for scientific text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 3615–3620, 2019
  • 14. 14 Methodology Dynamic Graph (Graph Attention) ● Aim to integrate knowledge from dynamic graphs with visual features ● Following [30], utilize cross attention to achieve this goal ● In each head, Query comes from visual features fI while Key and Value come from the learned graph representations fG ● Get the dynamic graph enhanced visual features ● The first token in both and fG is [CLS] to aggregate visual and graph information
  • 15. 15 Methodology Contrastive Learning ● Collecting paired image and text data is prohibitively expensive leading to a smaller size of CRG datasets compared with captioning datasets ● Introduce the image-report contrastive loss used in DCL, which can effectively improve the visual and textual representations as well as ensure the report retrieval accuracy in the dynamic graph construction process ● Inspired by recent vision-language pretraining works [21, 22], adopt an image-report matching loss in the proposed method for further enhancing the representations to improve the performance [21] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086, 2022 [22] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021
  • 16. 16 Methodology Contrastive Learning (Image-Report Contrastive Loss) ● Activate radiology reporting by encouraging the positive image-report pairs to have similar representations in contrast to the negative pairs ● Calculate the similar between 2 [CLS] representations by , where WI and WT are 2 learnable matrices ● Maintain 2 queues to store the most recent M image-report representations from the momentum image and report encoders ● After Softmax activation, image-report similarity , the report-image ● similarity is a learnable temperature parameter. The IRC can be written where is the cross entropy loss and g(I) is the ground truth of image-report similarity
  • 17. 17 Methodology Contrastive Learning (Image-Report Matching Loss) ● A binary classification task to predict whether the given image-report pair is positive (matched) or negative (unmatched) ● Utilize a multimodal encoder to capture the multimodal representations via cross attention mechanism ● [Encode] vector is projected to d = 2 with a linear layer to predict the probability . The IRM is conducted as ● Finally, calculate the sum of 3 losses as total loss function ● The multimodal encoder is only used during the training process to improve representation learning
  • 18. 18 Datasets 2 datasets • IU-Xray ○ 3955 radiology reports and 7470 chest xray images ○ 2069/296/590 cases for training validation/testing • MIMIC-CXR ○ 368960 chest X-ray images and 222758 radiology reports • Natural Language Generation Metrics (NLG) ○ Used to measure the descriptive accuracy of predicted reports • Clinical Efficacy Metrics ○ Capture and evaluate the clinical correctness of predicted reports • Experimental settings ○ Use the same image encoder for different views images and concatenate visual tokens
  • 24. 24 Conclusion • Present a practical approach to leverage dynamic graph to enhance contrastive learning for radiology report generation • Dynamic graph is constructed in a bottom-up manner to integrate retrieved specific knowledge with general knowledge • Contrastive learning is employed to improve visual and textual representations • The approach outperform or match existing SOTA methods in language generation and clinical efficacy metrics