Slides presented at the Thirty-Second International Joint Conference on Artificial Intelligence, 2023, Macao, SAR. https://doi.org/10.24963/ijcai.2023/554
NS - CUK Seminar: V.T.Hoang, Review on "Long Range Graph Benchmark.", NeurIPS...ssuser4b1f48
Similar to [IJCAI 2023] SemiGNN-PPI: Self-Ensembling Multi-Graph Neural Network for Efficient and Generalizable Protein-Protein Interaction Prediction (20)
Handwritten Text Recognition for manuscripts and early printed texts
[IJCAI 2023] SemiGNN-PPI: Self-Ensembling Multi-Graph Neural Network for Efficient and Generalizable Protein-Protein Interaction Prediction
1. SemiGNN-PPI: Self-Ensembling Multi-Graph Neural Network
for Efficient and Generalizable
Protein-Protein Interaction Prediction
Ziyuan Zhao1,2,*, Peisheng Qian1,*, Xulei Yang1,+ , Zeng Zeng3 , Cuntai Guan2 , Wai Leong Tam4 , Xiaoli Li1,2
Presenter: Ziyuan Zhao, Peisheng Qian
1 Institute for Infocomm Research (I2R), A*STAR, Singapore
2 School of Computer Science and Engineering (SCSE), Nanyang Technological University, Singapore
3School of Microelectronics, Shanghai University, China
4 Genome Institute of Singapore (GIS), A*STAR, Singapore
Paper ID: 2877
2. 2
Challenges in Protein-Protein Interaction Prediction
- Protein-protein Interactions (PPIs) are central to various cellular functions
and processes.
- Label Scarcity: PPIs need to be annotated and may not be available.
- Domain Shift: Models trained on one domain can suffer tremendous
performance degradation when evaluated on another domain.
3. 3
Improving Efficiency and generalization in PPI prediction
- Machine learning (ML) based, deep learning (DL) based and Graph
Neural Network (GNN) based methods have been investigated.
- However, dealing with imperfect data for improving model efficiency and
generalization in PPI prediction remains underexplored.
5. 5
Multi-Graph Encoding
- PPI Graph: proteins and PPIs.
- Label graph: PPI types and their correlations.
- Protein-Graph Encoding (PGE) aggregates
representations from neighboring proteins.
- Label-Graph Encoding (LGE) learns inter-dependent
classifiers.
- Multi-Graph Based Classifier applies the
learned classifiers from LGE to
representations from PGE for the PPI
prediction scores.
6. 6
Self-ensemble Graph Learning
- We adopt mean teaching with graph data augmentation.
- Edge Manipulation (EM): for connectivity variations,
we randomly replace a certain percentage of edges.
- Node Manipulation (NM): for attribute missing, we
randomly remove node features, with zero masking.
- We construct two augmented graph views for the student
and teacher networks for consistent predictions.
7. 7
Graph Consistency Constraints
- We model the fine-grained structural
protein-protein relations in the feature
embedding space [Ma et al., 2022].
- Edge matching:
- Student embedding graph
- Teacher embedding graph
- Consistent instance-wise correlations
- Edge matching loss
Yuchen Ma, Yanbei Chen, and Zeynep Akata. Distilling knowledge from self-supervised teacher by embedding graph alignment.
In 33rd British Machine Vision Conference. BMVA Press, 2022
Student Protein
Encoding
Teacher Protein
Encoding
8. 8
Graph Consistency Constraints
- Node matching:
- Edge embedding graph
- Aligning encoding of the same protein
- node matching loss
- Overall loss function
Yuchen Ma, Yanbei Chen, and Zeynep Akata. Distilling knowledge from self-supervised teacher by embedding graph alignment.
In 33rd British Machine Vision Conference. BMVA Press, 2022
Student Protein
Encoding
Teacher Protein
Encoding
9. 9
Datasets and Settings
- 3 datasets, STRING, SHS148k, and SHS27k.
- 7 PPI types: activation, binding, catalysis, expression, inhibition, post-
translational modification (ptmod), and reaction.
- Random, breath-first search (BFS), and depth-first search (DFS)
partitions [Lv et al., 2021].
- Evaluation metric: F1.
Guofeng Lv, Zhiqiang Hu, Yanguang Bi, and Shaoting Zhang. Learning unknown from correlations: Graph neural network for inter-novel-protein interaction prediction, Proceedings of the Thirtieth
International Joint Conference on Artificial Intelligence, IJCAI-21, pages 3677–3683
10. 10
Comparison with Baselines
- Comparing with Machine Learning (ML), Deep Learning (DL) approaches
and GNN-PPI (a strong graph-based baseline).
- We outperform baselines by a clear margin in all datasets and partition
schemes.
11. 11
Experiments under Label Scarcity
- We use 5%, 10% and 20% labels in the train set.
- Our method achieves better performance under all scenarios with
different datasets, label ratios, and partition schemes.
12. 12
Experiments under Domain Shift
- The model is trained and tested in 3
settings.
- Domain Generalization
(DG)
- Inductive Domain Adaptation
(IDA)
- Transductive domain adaptation
(TDA)
13. 13
Inter-novel-protein Interaction Prediction
- In the labeled train set
- BS subset (both proteins of the PPI are present).
- ES subset (either one protein of the PPI is present).
- NS subset (neither of the proteins is present).
14. 14
Conclusions
- We identified 2 challenges in PPI prediction, label scarcity and domain
shift. We addressed them with a novel SemiGNN-PPI for efficient and
generalizable multi-type PPI prediction.
- To enhance generalization capability, we constructed and processed
graphs at protein and label levels.
- To leverage unlabeled PPI data, We integrated GNN into Mean Teacher
and designed multiple graph consistency constraints.
- Experiment results validated the effectiveness of SemiGNN-PPI.
15. 15
Acknowledgement
This research was funded by Competitive Research Programme “NRF-
CRP22-2019-0003”, National Research Foundation Singapore, and partially
supported by A*STAR core funding.
Good afternoon session chairs and everyone here today. My name is Peisheng. Today I will present our paper SemiGNN-PPI: Self-Ensembling Multi-Graph Neural Network for Efficient and Generalizable Protein-Protein Interaction Prediction.
Protein-protein Interactions are central to many cellular functions and processes. The PPI prediction is a classification problem where we try to predict the classes of the interactions between to 2 proteins. This is important because PPIs have significant implications for drug development and disease diagnosis. However, in real-world scenarios, PPI prediction is affected by various factors, such as label scarcity and domain shift.
In the label scarcity scenario, PPI need to be annotated from experiments and only a small portion of them can be used for training. The lack of labels can be a significant bottleneck for PPI prediction.
In the domain shift scenario, most existing methods are only developed and validated using in-distribution data and they receive severe performance degradation for unseen data with different distributions.
Therefore, label scarcity and domain shift are 2 challenges in PPI prediction.
To deal with label scarcity, we aim at improving the data efficiency. To alleviate domain shift, we aim at enhancing the generalization capability for PPI prediction.
Computational approaches for PPI prediction include machine learning based and deep learning based methods. Since PPI can naturally be formulated as graphs with proteins as nodes and interactions as edges, graph neural networks have also been investigated. However, dealing with imperfect data for improving model efficiency and generalization remains a vital but underexplored issue.
In our approach,
For generalizable PPI prediction, we use multi-graph encoding to model protein correlations and label dependencies, in which, we construct graphs to learn correlations between proteins and label dependencies simultaneously. For data efficient PPI prediction, we advance GNN with Mean Teacher to explore unlabeled data for self-ensemble graph learning. Moreover, we apply multiple graph consistency constraints for regularization in self ensemble learning.
We will go through the 3 points in the following slides.
The proposed multi-graph encoding is based on 2 graphs. For the PPI graph, the protein are nodes and PPIs are edges. For the label graph, the PPI types are nodes and correlations between the PPI types are edges.
To obtain protein graph encodings, we use graph neural networks to aggregate representations from neighboring proteins in the PPI graph.
To learn correlations among different types of interactions, we learn inter-dependent classifiers using Graph Convolutional Network.
Then, we apply the learned classifiers from label graph encoding to the learned representations from protein graph encoding and obtain the predicted classification scores.
Next, to leverage unlabeled data, we adopt the mean teaching architecture. To facilitate self-ensemble graph learning, we use two data augmentation methods at both the edge and node level.
The edge manipulation aims to improve the robustness against connectivity variations. We randomly replace a certain percentage of edges in the input to the models since some PPIs could be unidentified or wrongly identified.
The node manipulation aims to improve the robustness against missing attributes, we randomly mask node features with zeros and feed them into the models. And we expect the model to effectively learn features even in the presence of missing attribute information.
We use edge and node manipulations to construct two augmented graph views to feed the student and teacher networks separately, and encourage them to generate consistent predictions.
Aside from consistency in the prediction space.
We also want to model the fine-grained structural protein-protein relations in the feature embedding space.
To achieve this, we use edge matching and node matching.
For edge matching, We calculate all pairwise Pearson’s correlation coefficient between nodes in the same batch from the student network and call it the student embedding graph. Similarly, we construct the teacher embedding graph. Then we enforce consistent instance-wise correlations using the edge matching loss. In the loss function, Gse refers to the student embedding graph, Gte refers to the teacher embedding graph, and Adj refers to the adjacency matrix.
We also use node matching as another constraint.
For node matching, We formulate the edge embedding graph by calculating all pairwise Pearson’s correlation coefficient between student encoding and teacher encoding in the same batch. We align the encoding of the same protein from the 2 networks in a node matching loss. In the loss function, Gste is the edge embedding graph, I is the identity matrix and diag is the operation to keep only diagonal values in the matrix and set the rest to 0.
The overall loss is a weighted sum of the supervised loss, consistency loss, node and edge matching loss.
Experiments were conducted on 3 datasets, STRING, SHS148k, and SHS27k. The PPIs are annotated with 7 types. Each PPI is labeled with at least one of them.
We follow existing partition algorithms, and use random, breath-first search, and depth-first search on protein nodes to create test sets with 20% of data. The rest of the data are used as the train set. The BFS and DFS create test data with more unseen proteins, which are more challenging scenarios.
We use F1 score as the evaluation metric.
We compare with Machine Learning and Deep Learning baselines. Particularly, GNN-PPI is a strong graph-based baseline for multi-class PPI prediction.
In this table, our method outperforms baselines by a clear margin in all datasets and all partition schemes.
Next, We use 5%, 10% and 20% labels in the train set to simulate the label scarcity scenario.
In this case, GNN-PPI receives severe performance degradation with fewer labels. In comparison, our method achieves better performance under all scenarios with different datasets, label ratios, and partition schemes.
To access the generalization capability of the proposed method, we test the model in 3 evaluation settings:
For Domain Generalization: The model does not have access to the trainset-heterologous dataset, and tested on the unseen dataset.
For Inductive Domain Adaptation: The model has access to unlabeled training data in the trainset-heterologous dataset.
For Transductive Domain Adaptation: The model has access to the whole un-labeled trainset-heterologous dataset.
Our method outperforms GNN-PPI in all of the 3 settings.
Next, we analyze model performance on inter-novel-protein interaction, where the proteins could be present or absent in the labeled trainset. The ES and NS subsets are more challenging because the PPI to predict is between proteins that the model did not see during training.
In this table, our method outperforms GNN-PPI in most subsets.
In conclusion, We identified 2 challenges in PPI prediction, label scarcity and domain shift. We addressed them with a novel self-ensembling multi-graph neural network for efficient and generalizable multi-type PPI prediction.
To enhance generalization capability, we constructed and processed graphs at protein and label levels. To leverage unlabeled PPI data, we integrated GNN into Mean Teacher and used multiple graph consistency constraints to align feature embeddings. Finally, experimental results proved the effectiveness of our approach.
The research was funded by National Research Foundation Singapore, and was partially supported by A*STAR core funding. We would also like to say thank you for all collaborators from NTU, Shanghai University and Genome Institute of Singapore.